OCHA-DAP / DAP-System

Webapp to manage DAP (workflow, data extraction)
3 stars 3 forks source link

Epic: Track changes to a data series and push the date of the latest change to CKAN #298

Open cjhendrix opened 9 years ago

cjhendrix commented 9 years ago

The goal is to allow all the information about a ckan indicator (which is simply a ckan dataset that is coming from CPS) to be maintained in one place: CPS. CPS would then have the ability to push changes to this information (let's call it Ancillary Indicator Information, AI2) to CKAN via CKAN's action API.

The goal of this epic is to set up the framework for this using a high value test case, described below.

Consider all the indicators returned from this search: https://data.hdx.rwlabs.org/dataset?q=fts+cross-appeal Note that the "Updated By" date for all of them is July 7, which is the date when the ckan datasets were created. However the data on CPS has been updated at least weekly since then, but CKAN has no way of knowing this. This epic will result in these dates being updated by CPS whenever a change is made to the data series. Later we will expand this approach to allow all of the AI2 to be managed in CPS.

The list of AI2 to be managed by CPS will ultimately include:

rufuspollock commented 9 years ago

I'm a bit unclear why wouldn't you manage all of this info directly in CKAN? It already has the capability to store all this kind of info and it saves you having to reinvent the wheel by adding support for this in CPS (and then pushing it back across into CKAN).

cjhendrix commented 9 years ago

We use CPS to import and normalize data and maintain referential integrity. Since at least some of the info has to be maintained on the CPS side, our data managers feel it would be easier to manage all of it there. This is just for those datasets that we curate, not user contributed datasets which live solely on CKAN.

rufuspollock commented 9 years ago

@cjhendrix I guess the question is why you couldn't maintain all the info on the CKAN side here based on DRY principles? Generally, I think it would be really useful (for me) to understand a bit more about the overall architecture especially of CPS to understand what is being done where and how as I can then offer more useful input :-)

cjhendrix commented 9 years ago

Note to Sam. Understood that this one will likely carry over multiple sprints given your availability.

seustachi commented 9 years ago

The biggest difficulty I see here is that CPS does not know about the curated datasets.

Instead, the curated datasets know about CPS.

If we add some kind of mapping, allowing CPS to know which curated datasets to update when some data (or metadata) changes are detected, we still have 2 places to maintain. If we add a new indicator, we have to create it in CPS, create the curated dataset, and they both must know about each other.

So we don't follow the DRY principle, and I am not sure this will be simpler for the data team.

The gain here would be that once this is set up, the updates should be replicated.

I think we should have a call dedicated to this topic.

seustachi commented 9 years ago

So after discussion, here is the plan :

There is a 1 to 1 relationship between dataseries and ckan datasets. So if we detect a change in the data or metadata for a dataserie, we can push it to the dataset.

What I can do already is the following : 1) Add some fields in the dataserie table :

2) Setup a job that will search the dataseries where Last metadata update > Last metadata push, and push the metadata to ckan (and update the last metadata push value)

rufuspollock commented 9 years ago

@seustachi it would be super useful to get a bit of a diagram here to understand what is going on - as mentioned you'll want to be careful about not ending up with your authoratative metadata in 2 places (and getting stuff out of sync).

cjhendrix commented 9 years ago

@seustachi The key thing we need to urgently solve is the high value test case listed in the original issue above. If I understand your last comment above, it sounds like you are putting that one as secondary. Happy to discuss, but I think you need to focus your effort on that one.

seustachi commented 9 years ago

@cjhendrix I don't put it as secondary priority.

To detect a change related to a dataserie is a prerequisite. To know how dataseries and datasets are related is also a prerequisite.

Then we will be able to push information to CKAN.

cjhendrix commented 9 years ago

Ok, thanks for the clarification.

seustachi commented 9 years ago

So, we agreed that :

LastUpdateDate changes only if at least one vale was added or updated

seustachi commented 9 years ago

List of the extras keys we wat to use :

"dataset_source" for the sourceName "dataset_source_code" for the source code

"indicator_type" for the IT Name
"indicator_type_code" for the IT code

"dataset_date": "11/02/2014-11/20/2014", for the date range of the data

"dataset_summary"
"methodology"
"more_info"
"terms_of_use"
"validation_notes_and_comments"
seustachi commented 9 years ago

Format of the action we want to use is documented here : https://gist.github.com/alexandru-m-g/09155dff01e8302acf47

seustachi commented 9 years ago

More info here : https://docs.google.com/document/d/1KqOQtDGgu-HE1VFDGg8te8fP9adlh1HHMBWv5muAQmg/edit

seustachi commented 9 years ago

@cjhendrix @alexandru-m-g I don't remember what we decided about the change to the dataset names.

Do we keep a human readable title (title_with_underscore___sourceCode) or do we want (indTypeCode_SourceCode)

I think I remember the CJ prefered the human readable. If we do that, we have to manage the title in CPS (to be able to push updates). Is it what we want ?

teodorescuserban commented 9 years ago

Please, when in doubt about any names, favor human readable over anything else and url slug over human readable.

cjhendrix commented 9 years ago

@seustachi It's the former, for example: https://data.hdx.rwlabs.org/dataset/proportion_of_the_population_using_improved_sanitation_facilities___mdgs

Alex is making the change in sprint 46 (2 week sprint starting 5 Jan): https://github.com/OCHA-DAP/hdx-ckan/issues/1771

As for managing the title in CPS, that should be fine. The only thing we shouldn't manage is the "name", which is used for the URL.

seustachi commented 9 years ago

So if I sum it up, we have to have a reference to the ckan dataset name in cps.

I see at least 2 places where it could belong in the data model.

As discussed with Alex a month ago, we could also create the ckan dataset from CPS, if it does not exist yet.

Please note we will need a setup phase, to have all datasets initialized in CPS.

seustachi commented 9 years ago

What has be decided for updating CKAN :

alexandru-m-g commented 9 years ago

@cjhendrix @seustachi I've created the spreadsheet with the mapping that we were talking about. It contains also the source code and indicator type code. Let me know if I've missed something.

https://docs.google.com/spreadsheets/d/1WSv34vNBUFKr6m12wIhyKvU5hn8vgSo6bN4eIYUx6CA/edit?usp=sharing

cjhendrix commented 9 years ago

Tagging @luiscape so he is aware.

luiscape commented 9 years ago

@alexandru-m-g I need permission to the Gdoc.

alexandru-m-g commented 9 years ago

@luiscape you should be able to access it now, right ?

seustachi commented 9 years ago

What we want now is to trigger the metadata update is a new indicator value is added or an existing one changed, because we need to change the range of values dates

seustachi commented 9 years ago

And we also want to update the date of the last "update" of the dataset. See with @alexandru-m-g if we store it in dataset or resource. This is a new metadata, update triggered when an update to the data is done

seustachi commented 9 years ago

@cjhendrix Moved to sprint 48.

Even if we started to implement this epic in sprint 46, and some work was also done on sprint 47, some sub-tasks are still pending and planned for sprint 48 or later