Change granularity of SW and other big imports

Current situation :

We import into CPS many indicators values from ScraperWiki (SW). SW is publishing files on CKAN, on a daily basis. We call the main one "ScaperWiki dataset", but there are others.

The published file contains all the values, for all the countries, of a quite large given list of indicators.

Everything is republished everyday, SW does not perform a diff from the previous publication.

CPS tries to import all the values every time we try to perform an import of a file. There is no simple way, from CPS PoV, to know what is new and what is not. So we try to write the database with every single value, just silently ignoring the errors.

The state of CPS at the end of the process is correct. New data are there. Old data are not getting duplicated. But we have 2 issues:

Import is a long process ( 2hours or more) and is likely to get longer over time.
No one knows what happens exactly during the import. We plan to give feedback to the user after import, but a report about 300000+ updates is too complicated to be usable.

Most of the indicators are not updated more than once a year. Processing them everyday makes no sense.

What I recommend :

Split the content of the file following 2 rules :

Put together indicators with similar update frequencies, and possibly similar type of content (to make it more understandable).
Then, for each of those files, put data before 2013 in one file, and data from 2014 in another. It makes unnecessary to make a diff on SW side. For updates, we just reprocess data from 2014 and after, which is cheap (for both dev cost and processing CPU time).

If we agree on that, what do we need ?

For the split into groups of indicators, we need the data team contribution

For 2, this is just technical coordination between SW, CKAN and CPS. So, namely, Dragon, CJ and Samuel.

OCHA-DAP / DAP-System

Change granularity of SW and other big imports #220