OCHA-DAP / Data-Team

A place for tracking data team issues
0 stars 1 forks source link

Run a QA for completeness of the XLS collections in CKAN #40

Closed JavierTeran closed 10 years ago

JavierTeran commented 10 years ago
  1. Simply open each collection and verify there is data in there.
  2. For a 'small' subset (a sample of at least a 1/4 of all collections) compare figures from collection and original sources. Note that each collection at this stage has passed the CPS test, therefore collections are curated.
JavierTeran commented 10 years ago

@takavarasha Lets discuss. There are some collections in CKAN (I haven't identified yet the extent of the problem) with no data. For instance, PVX040 is empty because ACLED modified the way they disseminate data and our scrapper can't get the data anymore. We need to repoint our scraper to the right location. @luiscape any suggestion on where to administer this?

luiscape commented 10 years ago

@JavierTeran My hunch is that we are getting empty datasets every time the scrapers fail for whatever reason. The reason why the data travels from its point of failure to a final user is probably a system error. If scrapers fail the data scraped last should be displayed instead. On that point, should we ask Andrew / the dev-team to build another validation test to check if there is any data in final files?

We can properly assess the issue doing what you suggested (opening files and comparing data). That should give us a clearer picture. Is this something you need my help also?

rosnfeld commented 10 years ago

Yes, this is definitely an issue I've raised before. There can be more complicated versions of it - for example World Bank scraping can often fail "half-way through", leaving you with partial results. I am quite sure the dev team is aware of it, though I'm not sure what github issues exist. (and unfortunately it's not really something I can help with at the moment given my lack of knowledge of the newly developed system)

JavierTeran commented 10 years ago

Thanks @luiscape @rosnfeld . Definitely, the issue of the incomplete scrapping is of the knowlege of @cjhendrix and the dev team. However, at the moment, there is not other way to capture it unless you open the collection and check that the all data is in there. One alternative I suggested was to ask Scraperwiki to provide a count of records per year every time the scraper runs. In that way, at least we know that data is consistent day after day. It may be wrong and inaccurate but at least complete. Something like this: World Bank PVX040 235 records on 2010, 233 records on 2011, ... and so on. If we see that the count varies significantly (I can define the range based empirically based on the range) an notification should pop up.

JavierTeran commented 10 years ago

@takavarasha How is the process going? Please let @luiscape if you need any help. Let me know if we can have an estimate of the extent of the problem by tomorrow 10 am when we have the meeting with Management. Thanks,

takavarasha commented 10 years ago

I expect to finish this task by COB today (15 March). I will share google spreadsheet with all the indicators that I find to have missing data.

takavarasha commented 10 years ago

I have checked the collections and have found 18 collections that have problems. In brief, 6 collections do not have resource (XLS files) at all, clicking on the "Go to resource" link for 2 collections results in an error, and 10 collections (UNDP(7), UNDESA(2) and ACLED(1)) have XLS files without any data in them.

I have recorded my detailed findings here: https://docs.google.com/spreadsheets/d/1uxl3IM9mbXKA3MmA2x4M9-YnjDDhjZmyl2eCf9nNN4Q/edit#gid=0

JavierTeran commented 10 years ago

Thanks @takavarasha for this work. We need to find the current/correct location of those collections to feed CPS and re-run the scraper. Could you include the URL for the datasets for each of the problematic collections. Please let me know if the request is not clear. Thanks, Javier

takavarasha commented 10 years ago

I have added the URLs for all except two (ACLED and HDR).

JavierTeran commented 10 years ago

Thanks @takavarasha

JavierTeran commented 10 years ago

@amcguire62 Let me know when you want to discuss. Thanks

takavarasha commented 10 years ago

Closing this and as the original issue was resolved. Some of the problems found with data are being resolved here https://github.com/OCHA-DAP/DAP-System/issues/222