Open cmrivers opened 10 years ago
I think there is no good, dependable OCR software that can do this task. And a dedicated team to just do this, is not feasible.
I just thought crowdsourcing it is the best way, if we are expecting more such pdf documents. In the long run, one could also decouple the data-collection, curation (tasks which can be crowdsourced to non-math, non-coders), from the modeling, analysis work. To help the crowd, can migrate to a less-geeky google docs kind of alternative.
Just giving a try. I collated the table pages alone in Guinea dataset pdf, and have created a google spreadsheet with the pivot column/row information (from 26th Aug onwards, the format is the same). It is accessible at ( bit.ly/ebola_guinea ). Have also added the 16th Sept, and 1st Oct .csv information. A few moderators could proof-read and 'freeze' cells which are confirmed (revision histories help too).
We've seen it on Wiki. We've seen it on reddit. Can we expect the Internet to do its magic here again?
@cmrivers Have you looked into Tabula (http://tabula.nerdpower.org/)? If your PDFs aren't scanned images it may be able to help you parse the data into tables faster. I'm familiarizing myself with your process -- I am a journalist in NY and am hoping to build some visualizations and an open API for this data. Do you have a list of your data sources -- I may built a scraper to fetch new data every 15 minutes to power the API for this.
Yes I use Tabula for most my digitization efforts. Some of the Guinea data are images though not data-embedded PDFs, and the tables are irregular from day to day. Data sources are linked on the top level README.
On Tue, Oct 14, 2014 at 5:45 PM, TC McCarthy notifications@github.com wrote:
Have you looked into Tabula (http://tabula.nerdpower.org/)? I'm familiarizing myself with your process -- I am a journalist in NY and am hoping to build some visualizations and an open API for this data. Do you have a list of your data sources -- I may built a scraper to fetch new data every 15 minutes to power the API for this.
— Reply to this email directly or view it on GitHub https://github.com/cmrivers/ebola/issues/37#issuecomment-59122174.[image: Web Bug from https://github.com/notifications/beacon/1302262__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcyODk0MjM1MCwiZGF0YSI6eyJpZCI6NDQzMjY1Mzh9fQ==--e5054fe740bfd27104c7296387aac2bce1f428df.gif]
Cool, thanks. I was clicking through those -- just wanted to make sure that list was exhaustive. Ugh government data makes me nuts lol -- no consistency. Thanks again!
Agree completely. I should add that efforts to build an API are underway. You can email me if you need more details.
On Tue, Oct 14, 2014 at 5:49 PM, TC McCarthy notifications@github.com wrote:
Cool, thanks. I was clicking through those -- just wanted to make sure that list was exhaustive. Ugh government data makes me nuts lol -- no consistency. Thanks again!
— Reply to this email directly or view it on GitHub https://github.com/cmrivers/ebola/issues/37#issuecomment-59122752.[image: Web Bug from https://github.com/notifications/beacon/1302262__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcyODk0MjU4NiwiZGF0YSI6eyJpZCI6NDQzMjY1Mzh9fQ==--5879ba01f162fa7cfbe90a65133fb9e38cb988ba.gif]
Make sure you first look at this file: with tons of detailed Guinea sub-national records: https://data.hdx.rwlabs.org/dataset/rowca-ebola-cases#
Figure out a way to make clear which Guinea situation reports have already been converted to PDF, and which still need to be done. I want to keep the PDFs in the repo even after they are digitized, since they are not available easily available online like the SL and Liberia sitreps.