OCHA-DAP / DAP-System

Webapp to manage DAP (workflow, data extraction)
2 stars 3 forks source link

Empty CSV and XLSX when no RW data for a country #213

Open cjhendrix opened 10 years ago

cjhendrix commented 10 years ago

http://data.hdx.rwlabs.org/dataset/gbr_rw_indicators/resource/GBR_RW.csv

The CSV produced is empty. Is this the expected behavior for CPS? It results in an unclear situation for the user (on ckan preview, or as a downloaded CSV.

luiscape commented 10 years ago

@cjhendrix The problem is that the ReliefWeb scraper isn't scrapping data from GBR. I configured it that way. ReliefWeb only scrapes data from the 73 "focus countries" defined earlier. I can adjust to scrape data on GBR, but I think that data isn't useful.

cjhendrix commented 10 years ago

Ok. We had decided for the baseline to have all countries, so we assumed the same for the other script-generated datasets (RW and FTS). It's up to the data team, we can:

  1. modify Luis' script to pull data for all countries
  2. modify the ckan_setup script to delete all the RW country centric datasets and recreate only the ones from the focus countries list. @luiscape is the list you used the first version (based on COD focus countries) or the second (and current) version, based on the Hum. Data and Trends List?
bmichiels commented 10 years ago

If you want a special handling of the reports if there is no data to display, please specify.

luiscape commented 10 years ago

@cjhendrix I will modify the ReliefWeb scraper to scrape data from all countries that they have data about. Note that ReliefWeb has data about 244 countries and territories, but I am not sure how much data they will actually have reports and disasters about (the two indicators we are currently scraping).

But I agree with @bmichiels 's comment above. We already have a number of datasets that do not have observations on developed countries. For instance, ACLED which is a dataset that focus on developing countries only if I am not wrong. Also, there are datasets that span through a historical period of time and contain old countries and territories such as the USSR, Yugoslavia, etc. UNHCR data, for example, identify those as countries / territories.

I have a feeling that this will be a recurring issue. How can we find more of a sustainable solution for it?

cjhendrix commented 10 years ago

We made a decision early on to ignore historic countries. I kind of wanted to include them, but the overall feeling was to focus on current countries and add the historic stuff if users demanded it.

So, without opening that pandora's box, we could have several possibilities:

  1. change CPS to produce empty exports that simply have a "no data available" text in the first cell (probably something a little more descriptive than that. It will be irritating for users to download an empty dataset, but at least it is definitive that the data is not available.
  2. have cps produce a "do not create dataset" list for each country that has no data and modify ckan_setup to ignore the ones on that list. The datasets would simply not exist, but users wouldn't understand why.

I'm sure we could brainstorm more ideas. I think I prefer the first for it's clarity and maintainability, but ultimately it's a Sarah/Data Team call as to how this should look.

luiscape commented 10 years ago

Hi @cjhendrix, I agree with not including historical countries. I would even focus further and only have data on our "focus countries". Assembling that list could be based on using a simple method, i.e. the countries that score more than 8 on InfoRM. From an analytical standpoint, having data from the UK seems almost irrelevant. But we can debate that at another time.

As of the options above, I would prefer the second. In my opinion there isn't anything more frustrating than finding a data file, downloading it (maybe programmatically!), and then be faced with an empty file. Too much leg work to realize that. I would much rather not have CPS generate those files in the first place.

The main reason is scalability to me: if is CPS configured to generate files for every single one of the 243 countries and territories (is that what it is configured to do?), then we will face an issue in which we will have quite a large number of files with no data in CKAN. So many that it will be hard to distinguish what has data and what hasn't (I think we are suffering with it right now). That issue will happen because most sources do not have data on a full list of 243 countries / territories. They have data on most of those, or a handful of those, or only a small subset.

If CPS is configured to generate an empty file for those datasets that do not have data on certain countries then I am inferring that around 10% of all datasets will be empty.

cjhendrix commented 10 years ago

In beta 0.1 we only had focus countries, but Sarah wanted to add the rest, which made sense because it beefs up our number quite a lot. But if you can convince her, we can remove them. :-) Sarah wanted the list of focus countries (which only appears in the drop down on the home page) to be drawn from the Hum. Data and Trends List.

CPS does indeed generate a file for any country in it's list, regardless of whether the file is empty or not.

I totally agree with you on it being a concern. Originally in Warp Coil, we were only going to add datasets that could be done manually. The only ones from CPS were the baseline ones, for which there is data for all countries. We didn't consider what would happen with empty countries when we added RW/FTS to CPS.

I still prefer us to produce something since that makes it clear that it's not an oversight on our part, but that there is no data for that country. Perhaps we could use a CSV of "empty outputs" from cps to create datasets on CKAN that have "No data available" in the description, and therefore no files? Again, I think it's a decision for Sarah/Data Team.

luiscape commented 10 years ago

Thanks for the explanation. What is Hum. Data and Trends List?

I'll bring in Sarah @ochadataproject and @JavierTeran here. It seems that we need a decision point sometime soon. @cjhendrix when do you think we have to make this decision?

In my estimate, if CPS generates files to every single one of the 243 countries we have, I see that around 10% of files generated will have no data. For an analyst I think that would be deceiving and annoying.

cjhendrix commented 10 years ago

Looks like we haven't moved on this and no one is complaining. When we start on the task of giving CPS control of the indicator datasets on CKAN, we will need to think of this. I'm flagging the issue as part of the metadata epic.

luiscape commented 10 years ago

@cjhendrix FYI that the change is made by adjusting a parameter on the scraper. It should take about 15 minutest to implement and about 2 hour to properly test for deployment.

(Ps: the scraper is based on a package / library I developed in R here called ReliefWeb. The thing has a public release, but it needs a few updates here and there.)

cjhendrix commented 10 years ago

We would still have to identify and delete all the datasets on ckan as well. Or, in fact, we could just delete the datasets. Then having empty indicators on CPS wouldn't matter (for now). So it seems to me all the action here is on the data team side, so I leave it up to you guys how you want to deal with it. Let me know if you think something is needed on the dev side to support what you want to do.

amcguire62 commented 10 years ago

Not sure this is helpful but my experience last week on a topic that sounds vaguely familiar so .please bear in mind the route the user will take to get to the data.. If we flag clearly no data if they follow the link and no data - what is the surprise... If we find we are sending users down a dead end and they are bouncing off from that point we know we have to fix...So I dont think we should worry too much about random users finding pages - this will be the minority

luiscape commented 10 years ago

Just fyi @amcguire62 and @cjhendrix, I have been tasked by @JavierTeran to work on this issue. I should be updating the ReliefWeb scraper on Thursday (16).

cjhendrix commented 10 years ago

Thanks for the update.

luiscape commented 10 years ago

Hi there folks, the ReliefWeb scraper has been refactored. To keep things clean and separate, I created a new scraper (i.e. 'Box') on ScraperWiki. Here is the link: https://scraperwiki.com/dataset/s6ahhhn/settings

Here is the link to the output (i.e. ZIP package) : https://ds-ec2.scraperwiki.com/s6ahhhn/mszpf8o7fue7jpx/http/output.zip

The new scraper contains data since 1971 and about a total of 245 countries. I will run some texts early next week with @takavarasha to make sure everything is fine and then follow up with @cjhendrix about updating the scripts that upload that resource to CKAN, and ingesting this output into CPS.

Let me know if anyone has any questions.

cjhendrix commented 10 years ago

Cool. Just create an issue in the infrastructure repo for @teodorescuserban when it's ready. He can update the scripts that pull it to ckan.