OCHA-DAP / Data-Team

A place for tracking data team issues
0 stars 1 forks source link

Bring FTS data into HDX repository #29

Closed JavierTeran closed 9 years ago

JavierTeran commented 10 years ago

Obtain the data from FTS Create indicators Upload the data into HDX repository

luiscape commented 10 years ago

I am assigning this to me just for exploration. However, @rosnfeld knows this data well.

rosnfeld commented 10 years ago

Yup, I am working on this right now.

JavierTeran commented 10 years ago

Thanks. I will reassigned to Andrew just for administrative purposes.

rosnfeld commented 10 years ago

One slightly thorny issue: while most SW indicators are organized by "Country", FTS data is generally organized by "Appeal" (and this is reflected in the CHD spreadsheets). My understanding is there can be a "consolidated appeal" each year for each prolonged crisis in a region, but there can be additional "flash appeals" to handle sudden events like the typhoon in the Philippines. Funding is organized around these appeals.

How would we like to make FTS data "conform" to the country-centric nature of the CKAN setup/SW data?

My current preference, though relatively uninformed, would be to combine all appeals (consolidated and flash) for a given region-year. What do others think? Were there plans for how to handle this when the CHD was drawn up? (the original CHD spreadsheet does have an "Entity type" column, and perhaps there are yet further entity variants that need to be handled)

JavierTeran commented 10 years ago

Based on what we found in our ReliefWeb research for the CHD, the consolidation by country-year will be the most 'recurrent' data need. On each country-year collection, you could combine all appeals.

rosnfeld commented 10 years ago

Great, thanks - I will run with that.

Next question - where do the Cluster definitions come from in the CHD? I see a different set of Clusters depending on where I look. For example, CHD has Agriculture but these sites do not:

http://www.unocha.org/what-we-do/coordination-tools/cluster-coordination https://www.humanitarianresponse.info/clusters/space/page/what-cluster-approach

Those sites have Emergency Telecommunication, but the CHD does not.

If I look at FTS data, say for Kenya in 2013, I see similar names to the CHD, but then there are also clusters like "MULTI-SECTOR ASSISTANCE TO REFUGEES". Is that the same as "Camp Coordination and Camp Management" in the CHD? Is "FOOD ASSISTANCE" the same as "Food Security" in the CHD? And I don't see "Logistics" in that FTS data.

The CHD itself has a few inconsistencies (I think this is the latest version?):

https://docs.google.com/spreadsheet/ccc?key=0AoSjej3U9V6fdGd6UFZMeXZSWmp5NDZfbEJuX1hKR1E&usp=drive_web#gid=4

e.g. FA020-FA130 have slightly different names than FA150-FA270.

The bulk of the FTS indicators are by cluster, so any guidance on this would be helpful.

JavierTeran commented 10 years ago

The CHD has passed through several hands and I personally ignore the reasons behind the definitions for FTS data. However, we are taking the ownership of it. Therefore, for the sake of simplicity, lets stick with the cluster name as defined here http://www.unocha.org/what-we-do/coordination-tools/cluster-coordination. You will have 11 clusters.

JavierTeran commented 10 years ago

Pasting message from Andrew on FTS: After playing with the FTS data/indicators for a few days I have some new perspectives. Unfortunately, I think that we probably need to do some deeper requirements work before pushing hard on implementation. (though perhaps we can do something basic for the short term)

As I noted on github, FTS data is organized by appeal and thus doesn't always map perfectly to the region-year pattern that the ScraperWiki data has. I also mentioned that the cluster definitions aren't consistent - within the CHD, inside FTS data itself, etc. I can use the official-looking clusters you recommended, though that means that the "Telecom" bucket will likely be blank throughout, as I haven't seen FTS tag any data with that cluster.

In general I think the data quality isn't great - as I showed in a visualization I sent the other week, there isn't a lot of FTS data history, and it's likely incomplete even for the years that do have data. It's perhaps an interesting challenge for the proposed data quality framework, as I suspect that the data quality varies significantly by year. (with more recent data having higher quality). I'm not sure whether to fill in "zeros" for missing data (e.g. 0 for a given cluster in a given region-year) - in many cases it's likely missing rather than zero.

I think the set of CHD indicators probably need to be revisited. Some of them don't really make sense to me, for example most of the FN-series (why have country-specific indicators like FN020 and FN040?). Some of them seem to be specified at a global scope - e.g. FY560 through FY610. FN190-210 capture some of but not all the agency types present in FTS data, and I'm not sure if this is intentional. The various "Total across category X"-type indicators seem redundant to me (no matter how the pie is sliced, the total is always the same number), but perhaps I'm misunderstanding their intent.

One could also think about adjusting the USD values for inflation. This is something I've seen in OECD data and they seem to spend a fair bit of effort towards maintaining it.

There are other minor questions to be asked - FTS reports break out "carry over" of funding from previous years, though this isn't addressed in the CHD.

I think we probably need the assistance of an FTS domain expert to answer some of these questions. That may be outside of the scope of what Sarah had intended for the near-term project of getting (some) FTS data into CKAN.

Should I just "cherry pick" the easiest indicators that have the least issues and do what I can for now, even if we may have to redo the definition of these indicators later?

Thanks for any advice you can provide, Andrew.

JavierTeran commented 10 years ago

Thanks @rosnfeld for this work. Sorry for the late response but FTS data is not easy to understand yet for me.

For the time being, lets put aside the CHD FTS definitions of indicators, because as you pointed out many of them need to be revised and probably the whole list re-defined. I am not sure if they would pass the TURC test of indicators. Also, I am concerned to replicate FTS data only in a partial way and create misconceptions of the data, with the obvious political implications. FTS data is very important for the project and if we can include some indicators with specific scope it would be great. If you can show me some indicators ('cherry' ones) with the least caveats that would be great. We can always be correct as long as we define what the data includes and what doesn't.

rosnfeld commented 10 years ago

Thanks, @JavierTeran .

For what it's worth, here are SW-style CSVs containing my latest progress on CHD indicators for COL, KEN, YEM (note that COL does not have much data in FTS): https://drive.google.com/folderview?id=0BwxtRla5zLd_d1NsWC11eWNCRGc&usp=drive_web

Hopefully the indicator descriptions are clear enough - they are meant to match up with the CHD, just "translated" somewhat for the fact that we have to work across appeals.

It's still a work in progress:

As discussed, I am a bit concerned about the quality of the cluster tagging (inconsistent cluster names seen even within FTS data), so I wouldn't feel confident including cluster-based indicators at this time. Unfortunately, that's the bulk of the CHD indicators "by count". Hopefully the remaining summary-level indicators are still worthwhile.

Also, if we're going to revisit the list of indicators, we might consider removing indicators that are simple ratios of other indicators - e.g. FY030 is basically FY040 divided by FY020, and FY640 = FY620/FY630.

So, given that these are implemented, I'd be curious to know if they're the kinds of things we'd be interested in, or should we really be computing slightly different things from FTS data.

rosnfeld commented 10 years ago

Another update: I figured out the ERF issue - while most of the FTS site and the original CHD is organized by appeal, there can be substantial funding not associated with any appeal. If you instead look at funding by "emergency" it will capture all associated appeal funding as well as this other funding. Quite often there can be an emergency without an appeal - as in the case of most COL data.

As mentioned earlier, this indicates an FTS expert might be useful here. I suspect we may want to revise other indicators to be at the "Cross-Emergency" rather than "Cross-Appeal" level.

JavierTeran commented 10 years ago

Thanks @rosnfeld. Please clarify, this effort is for the three pilot countries? Could we use your methodology for the rest of the countries? Or would it require a detailed revision on country basis? Thanks

rosnfeld commented 10 years ago

I've just been using the three pilot countries as a "demo" (it takes less time to generate the data and is easier to inspect the results), but there is no manual work involved and it's trivial to extend to the full set of countries.

I did some QA on my work yesterday and found a few small discrepancies versus FTS's own reports, though some of them are reflected on the FTS website itself - one report claims a number is one value and another report claims a number that is slightly off. I've emailed Sean Foo in Geneva (the FTS API developer) about this.

JavierTeran commented 10 years ago

@rosnfeld Great news! Thanks. Now I am checking with @Aidan to see how we can go about and put it in SW scrapers for daily run

rosnfeld commented 10 years ago

Ok, sounds good.

By the way, here is a quick frequency-count of the different clusters present in FTS project data: https://docs.google.com/spreadsheet/ccc?key=0AgxtRla5zLd_dE10a0NoUTVRdXoyXy1NWEk5VjJvM0E&usp=sharing

I could probably collapse a chunk of those by forcing all-caps cluster names, but it seems the standardization problem is still much worse than I feared.

JavierTeran commented 10 years ago

@rosnfeld I am not sure, it does not seem to be too easy to the grouping.

Back on the first effort and for the 21 indicators that you have, would it be too hard to run your code for all countries, I would like to present all data to @Aidan and David for them to see the impact of including this data by the end of may through CPS. Thanks

rosnfeld commented 10 years ago

Sure, here's an export of data for all countries (takes ~45 mins to generate on my machine, despite its modest size): https://drive.google.com/folderview?id=0BwxtRla5zLd_d2l1aml6eEZVNnM&usp=drive_web

Caveats:

rosnfeld commented 10 years ago

I guess another caveat is that there are generally missing rows instead of "0"s - if that was to change, the number of rows would increase dramatically.

Also - if ScraperWiki people would prefer a file-based database (I think they use SQLite in their code?) over CSV that's not hard to do.

JavierTeran commented 10 years ago

@amcguire62 We will use FTS data as first attempt to bring data into CKAN via CPS

rosnfeld commented 10 years ago

I've made a couple updates to the FTS data: 1) (minor) I removed 2 indicators (FY030, FY640) that were just simple ratios of other indicators 2) (major) Zero is now the default assumption for each indicator-region-year, rather than leaving missing values. Size of file is now larger to reflect that. I'm not sure how I feel about this change but let's try it out. I've used 1999 as the "start year", as that's currently the first year there is any recorded FTS data, and the script looks for data up until the year after the script is run, as I noticed 2014 "plans" started to show up in FTS before 2013 was over.

A generation of data for all countries using this latest code is at: https://drive.google.com/folderview?id=0BwxtRla5zLd_OFV0V1NRcHlsNVE&usp=drive_web

The latest code is in my "scratch work" repository: https://github.com/rosnfeld/un/blob/master/ocha/dap/fts/ckan_loading/generate_chd_indicators.py

but I can bring it over to the official one: https://github.com/OCHA-DAP/DAP-FTSCollector/blob/master/ckan_loading/generate_chd_indicators.py

if people like these changes.

JavierTeran commented 10 years ago

@rosnfeld Thanks. @Aidan. Lets discuss tomorrow.

rosnfeld commented 10 years ago

@JavierTeran did you mean to ping @amcguire62 ?

amcguire62 commented 10 years ago

Thanks guys - I got it and my github is indeed amcguire62 :)

Aidan

On Fri, May 2, 2014 at 12:47 AM, rosnfeld notifications@github.com wrote:

@JavierTeran https://github.com/JavierTeran did you mean to ping @amcguire62 https://github.com/amcguire62 ?

— Reply to this email directly or view it on GitHubhttps://github.com/OCHA-DAP/Data-Team/issues/29#issuecomment-41969885 .

Aidan ScraperWiki.com - A Gigaomhttp://gigaom.com/2014/01/28/the-structure-data-awards-honoring-the-best-data-startups-of-2013/ Structure Data Award Winner 2014

JavierTeran commented 10 years ago

@rosnfeld Thanks!

amcguire62 commented 10 years ago

I have appended at the end the notes from todays meeting to the Data IN document https://docs.google.com/a/scraperwiki.com/document/d/1-sbMaTMYxCHfz1Nyz0sIAx87anxZ3xu2lvSfvwaff3M/edit#

Main point for me is that we have a number of actions pretty much for everyone (apart from Sarah and small one for CJ) and that ScraperWiki team will deliver data into the RAW CKAN system by close of business on the 12th. We will hopefully get data in sooner and in order of priority but this is our drop down dead time.

Aidan

JavierTeran commented 10 years ago

@rosnfeld When you were working on FTS data, did you come across a definition of cross-appeal? I need that for some visuals we are developing, thanks

rosnfeld commented 10 years ago

No, sadly I just made that term up to handle the concerns I raised earlier on this issue.

As I understand it, FTS basically has a couple types of appeals: consolidated appeals for protracted emergencies (civil war, drought, etc), and flash appeals for sudden-onset emergencies (earthquake, typhoon, flooding, etc). A country can have 0 or 1 consolidated appeals (CAP) per year, and 0 to several flash appeals for a given year. FTS allows you to query "what are the appeals for a given region-year" and my "cross-appeal" approach is basically summing the requirements/funding/etc across those region-year appeals.

Not to sound like a broken record, but an FTS subject expert might have done things differently. We're somewhat constrained by the existing DAP schema of associating data by "region-period" - looking over the data team chat it seems Luis ran into that with UNHCR data also.

luiscape commented 9 years ago

I am closing this issue as it seems that it has been completed.

rosnfeld commented 9 years ago

Yes, and please let me know if you need any help with the code going forward.

luiscape commented 9 years ago

@rosnfeld Most certainly! Thank you for everything thus far!