OCHA-DAP / Data-Team

A place for tracking data team issues
0 stars 1 forks source link

Bring ReliefWeb data into HDX repository #26

Closed JavierTeran closed 10 years ago

JavierTeran commented 10 years ago

Create a scraper to access ReliefWeb data Define indicators based on ReleifWeb data Upload ReliefWeb data into HDX repository

luiscape commented 10 years ago

The entities 'job' and 'training' from the API are not giving all the data. Apparently they are only returning entries that are currently 'open', that is, all the entries that have been 'closed' in the system are not being returned in the API. Effectively, this means that the API isn't returning historical data.

I've submitted a ticket to ReliefWeb and am trying to debug this issue with them. For the time being, jobs and trainings data will not be included in the first dump. All the following will:

luiscape commented 10 years ago

The data in the form of year vs. country has been collected and standardized. The summaries of the data can be found here: https://github.com/luiscape/ocha-rw-creating-indicators/tree/master/data-summary

I am working to have a file per country that contains all the indicators from ReliefWeb in it.

luiscape commented 10 years ago

Here you can find all the data organized so far: https://github.com/luiscape/ocha-rw-creating-indicators

The indicators from ReliefWeb can be found in the following folders:

There were problems extracting data from the following ReliefWeb "entities":

There isn't much I can do jobs and traininigs. I've reported the issue to Shuichi and am waiting for a response. As for the sources issue, I am looking for a way to hack the API.

The missing part of this task is the METADATA file. I've created a simple metadata file here: https://github.com/luiscape/ocha-rw-creating-indicators/blob/master/METADATA.csv

With that, I am closing the issue and opening another one for sources only. Please open it again if you think otherwise. @JavierTeran @takavarasha

JavierTeran commented 10 years ago

@luiscape Thanks Luis, this is great. Lets wait until Next Wednesday (April 30) for an answer from Shuichi to complete the other two indicators (jobs and training). Otherwise we close this.

@Aidan (have to add him) Two indicators have been defined and completed: Number of disasters and number of Number of reports per country per year. The organization is indicator-centric and country-centric.

luiscape commented 10 years ago

@Aidan and @JavierTeran I've put the two indicators extracted (Number of Reports and Number of Disasters) in the CPS format: https://github.com/luiscape/ocha-rw-creating-indicators/tree/master/cps-export

Notice that:

Please advise on these so I can implement the code.

luiscape commented 10 years ago

Also @rosnfeld I assumed that the is_number column was a validation test that checked if there was a number in the value column. If true I add a 1, if false I add a 0. Is that right?

Finally, what is the correct value for the units column? I added integer, but have no idea if that is correct ...

rosnfeld commented 10 years ago

Yes, 1 for numeric and 0 for non-numeric. It's actually not from the validation work, ScraperWiki always had that column in there. Some of their indicators are just strings, e.g. they have an "indicator" for the name of a country.

The units column seems to be fairly ad-hoc in the ScraperWiki data, so I wouldn't worry about it too much. Lots of scraperwiki indicators use "count" as the units, which I think might work for you?

I'm not sure if the CPS import actually looks at either "is_number" or "units".

luiscape commented 10 years ago

Thank you so much for the quick answer @rosnfeld . Most things cleared now. I'll add count instead of integer.

@JavierTeran I just added data from ReliefWeb's sources. The API doesn't serve data on when each organization joined ReliefWeb. So, on the period tab I simply added 2014. Let me know if you would like another approach.

Here is the latest data: https://github.com/luiscape/ocha-rw-creating-indicators/tree/master/cps-export

luiscape commented 10 years ago

@JavierTeran here you can find the code I wrote that runs on ScraperWiki's server: https://github.com/luiscape/reliefweb-scraperwiki-collector

It needs validation and some polishing, but it should be working fine on ScraperWiki's platform -- which is wonderful. Next week I will check with Dragon about further integrating it with our other scrapers.

It's not totally done, but I think we can rest assured that ReliefWeb data will be in our platform in a recurrent and sustainable way. :+1:

JavierTeran commented 10 years ago

@luiscape Thanks Awesome! @Aidan @David Are we discussing the next steps tomorrow?

luiscape commented 10 years ago

Hi @JavierTeran and @takavarasha I believe most of the tasks in this issue have been completed by now:

From the above only the last needs some work. (I am unsure what is the responsible or that though.)

JavierTeran commented 10 years ago

@luiscape As discussed, can you manually create an indicator for RW number of reports? I know it is not a sustainable solution but we will do this while we find a solution to the duplicate name problem handling in CKAN (or is it in CPS?)

luiscape commented 10 years ago

Hi @JavierTeran, yes I can without problems. I'll leave it ready today, and upload it to the HDX Repo once the problem has been fixed. I am also having a conversation with @cjhendrix here about finding a more sustainable solution. The bigger issue is there will be other instances in which indicators have the same name. For example:

A simple solution would be to add the source name to the indicator name, having something like "Number of Disasters by ReliefWeb" or alike. That solution isn't elegant though and I would prefer something a little better. But let's see how that discussion evolves.

luiscape commented 10 years ago

Closing as the new CHD coding scheme has changed.