Open cjhendrix opened 10 years ago
@cjhendrix The problem is that the ReliefWeb scraper isn't scrapping data from GBR. I configured it that way. ReliefWeb only scrapes data from the 73 "focus countries" defined earlier. I can adjust to scrape data on GBR, but I think that data isn't useful.
Ok. We had decided for the baseline to have all countries, so we assumed the same for the other script-generated datasets (RW and FTS). It's up to the data team, we can:
If you want a special handling of the reports if there is no data to display, please specify.
@cjhendrix I will modify the ReliefWeb scraper to scrape data from all countries that they have data about. Note that ReliefWeb has data about 244 countries and territories, but I am not sure how much data they will actually have reports and disasters about (the two indicators we are currently scraping).
But I agree with @bmichiels 's comment above. We already have a number of datasets that do not have observations on developed countries. For instance, ACLED which is a dataset that focus on developing countries only if I am not wrong. Also, there are datasets that span through a historical period of time and contain old countries and territories such as the USSR, Yugoslavia, etc. UNHCR data, for example, identify those as countries / territories.
I have a feeling that this will be a recurring issue. How can we find more of a sustainable solution for it?
We made a decision early on to ignore historic countries. I kind of wanted to include them, but the overall feeling was to focus on current countries and add the historic stuff if users demanded it.
So, without opening that pandora's box, we could have several possibilities:
I'm sure we could brainstorm more ideas. I think I prefer the first for it's clarity and maintainability, but ultimately it's a Sarah/Data Team call as to how this should look.
Hi @cjhendrix, I agree with not including historical countries. I would even focus further and only have data on our "focus countries". Assembling that list could be based on using a simple method, i.e. the countries that score more than 8 on InfoRM. From an analytical standpoint, having data from the UK seems almost irrelevant. But we can debate that at another time.
As of the options above, I would prefer the second. In my opinion there isn't anything more frustrating than finding a data file, downloading it (maybe programmatically!), and then be faced with an empty file. Too much leg work to realize that. I would much rather not have CPS generate those files in the first place.
The main reason is scalability to me: if is CPS configured to generate files for every single one of the 243 countries and territories (is that what it is configured to do?), then we will face an issue in which we will have quite a large number of files with no data in CKAN. So many that it will be hard to distinguish what has data and what hasn't (I think we are suffering with it right now). That issue will happen because most sources do not have data on a full list of 243 countries / territories. They have data on most of those, or a handful of those, or only a small subset.
If CPS is configured to generate an empty file for those datasets that do not have data on certain countries then I am inferring that around 10% of all datasets will be empty.
In beta 0.1 we only had focus countries, but Sarah wanted to add the rest, which made sense because it beefs up our number quite a lot. But if you can convince her, we can remove them. :-) Sarah wanted the list of focus countries (which only appears in the drop down on the home page) to be drawn from the Hum. Data and Trends List.
CPS does indeed generate a file for any country in it's list, regardless of whether the file is empty or not.
I totally agree with you on it being a concern. Originally in Warp Coil, we were only going to add datasets that could be done manually. The only ones from CPS were the baseline ones, for which there is data for all countries. We didn't consider what would happen with empty countries when we added RW/FTS to CPS.
I still prefer us to produce something since that makes it clear that it's not an oversight on our part, but that there is no data for that country. Perhaps we could use a CSV of "empty outputs" from cps to create datasets on CKAN that have "No data available" in the description, and therefore no files? Again, I think it's a decision for Sarah/Data Team.
Thanks for the explanation. What is Hum. Data and Trends List?
I'll bring in Sarah @ochadataproject and @JavierTeran here. It seems that we need a decision point sometime soon. @cjhendrix when do you think we have to make this decision?
In my estimate, if CPS generates files to every single one of the 243 countries we have, I see that around 10% of files generated will have no data. For an analyst I think that would be deceiving and annoying.
Looks like we haven't moved on this and no one is complaining. When we start on the task of giving CPS control of the indicator datasets on CKAN, we will need to think of this. I'm flagging the issue as part of the metadata epic.
@cjhendrix FYI that the change is made by adjusting a parameter on the scraper. It should take about 15 minutest to implement and about 2 hour to properly test for deployment.
(Ps: the scraper is based on a package / library I developed in R here called ReliefWeb. The thing has a public release, but it needs a few updates here and there.)
We would still have to identify and delete all the datasets on ckan as well. Or, in fact, we could just delete the datasets. Then having empty indicators on CPS wouldn't matter (for now). So it seems to me all the action here is on the data team side, so I leave it up to you guys how you want to deal with it. Let me know if you think something is needed on the dev side to support what you want to do.
Not sure this is helpful but my experience last week on a topic that sounds vaguely familiar so .please bear in mind the route the user will take to get to the data.. If we flag clearly no data if they follow the link and no data - what is the surprise... If we find we are sending users down a dead end and they are bouncing off from that point we know we have to fix...So I dont think we should worry too much about random users finding pages - this will be the minority
Just fyi @amcguire62 and @cjhendrix, I have been tasked by @JavierTeran to work on this issue. I should be updating the ReliefWeb scraper on Thursday (16).
Thanks for the update.
Hi there folks, the ReliefWeb scraper has been refactored. To keep things clean and separate, I created a new scraper (i.e. 'Box') on ScraperWiki. Here is the link: https://scraperwiki.com/dataset/s6ahhhn/settings
Here is the link to the output (i.e. ZIP package) : https://ds-ec2.scraperwiki.com/s6ahhhn/mszpf8o7fue7jpx/http/output.zip
The new scraper contains data since 1971 and about a total of 245 countries. I will run some texts early next week with @takavarasha to make sure everything is fine and then follow up with @cjhendrix about updating the scripts that upload that resource to CKAN, and ingesting this output into CPS.
Let me know if anyone has any questions.
Cool. Just create an issue in the infrastructure repo for @teodorescuserban when it's ready. He can update the scripts that pull it to ckan.
http://data.hdx.rwlabs.org/dataset/gbr_rw_indicators/resource/GBR_RW.csv
The CSV produced is empty. Is this the expected behavior for CPS? It results in an unclear situation for the user (on ckan preview, or as a downloaded CSV.