ckan / ckanext-dcat

CKAN ♥ DCAT
163 stars 142 forks source link

resources are re-created on reharvest #91

Closed dlax closed 6 years ago

dlax commented 7 years ago

It seems that upon reharvest of a DCAT source, resources of dataset are deleted and recreated.

This can be seen in the following excerpt from "fetch consumer" process logs:

2017-05-03 07:52:43,487 INFO  [ckanext.harvest.queue] Received harvest object id: 1e42effb-7dbb-492a-b8c8-e0681be0c287
2017-05-03 07:52:43,510 DEBUG [ckanext.dcat.harvesters.rdf] In DCATRDFHarvester import_stage
2017-05-03 07:52:43,759 DEBUG [ckanext.datapusher.plugin] Submitting resource e906f816-14af-40c3-99c8-42f10e8ec2f1 to DataPusher
2017-05-03 07:52:43,879 DEBUG [ckanext.datapusher.plugin] Submitting resource 8ba84cb8-dcfa-4404-9ea5-529e39e1eebc to DataPusher
2017-05-03 07:52:43,995 DEBUG [ckanext.datapusher.plugin] Submitting resource c028af71-86f8-4735-86f6-164b96ffe28b to DataPusher
2017-05-03 07:52:44,214 DEBUG [ckanext.archiver.plugin] Notified of package event: liste-des-marches-publics-conclus-par-le-departement-de-la-gironde changed
2017-05-03 07:52:44,234 DEBUG [ckanext.archiver.plugin] Comparing with revision: 2017-05-03 07:27:29.780708 94a67ddc-b973-41e5-ac88-8bc9fce4c116
2017-05-03 07:52:44,306 DEBUG [ckanext.archiver.plugin] Deleted resources - will archive. res_ids=set([u'2bd92e64-a138-46c3-a17f-962d7468ad5f', u'c3711e80-1514-4409-944f-27999512d1a6', u'bdb4934c-ade4-48c2-a97c-501a8d7f7fdf'])
2017-05-03 07:52:44,306 DEBUG [ckanext.archiver.plugin] Creating archiver task: liste-des-marches-publics-conclus-par-le-departement-de-la-gironde
2017-05-03 07:52:44,310 DEBUG [ckanext.archiver.lib] Archival of package put into celery queue priority: liste-des-marches-publics-conclus-par-le-departement-de-la-gironde
2017-05-03 07:52:44,333 INFO  [ckanext.dcat.harvesters.rdf] Updated dataset liste-des-marches-publics-conclus-par-le-departement-de-la-gironde
2017-05-03 07:52:44,356 INFO  [ckanext.harvest.queue] Received harvest object id: 09cb47c9-f93f-46aa-88a1-0ca2b62760ac
2017-05-03 07:52:44,380 DEBUG [ckanext.dcat.harvesters.rdf] In DCATRDFHarvester import_stage

As can be seen, my setup includes datapusher/datastore plugins.

I noticed that, on every reharvest, prior resources are not anymore available even though nothing changed in the source in the meantime; in particular, their uuid changed so that one gets a 404 on old resources URL.

It's also worrying as it seems that resources are not cleared from the datastore as the datapusher actually accumulated them. Upon every reharvest, in datapusher's process log, I get a message like:

Successfully pushed 62220 entries to "937a73d6-ba5b-46db-934e-944ca65eb2b2".
dlax commented 7 years ago

I tried to use a package_patch action instead of package_update (see https://github.com/dlax/ckanext-dcat/commit/ecaf0de5a3b4d34788274c31bd0c3696a19110b6) but this is not enough as only dataset attributes are considered in the patch action and resources data is just passed to package_update which leads to ckan.lib.dictization.model_save.package_resource_list_save() in which resources' replacement occurs.

metaodi commented 7 years ago

This is actually not a simple problem, because if you want to update a resource, you need to have a way to identify that a resource in CKAN is the same as the one you are harvesting. Afaik in DCAT-AP EU a distribution does not have a unique identifier, so there is no certain way to know which distribution in DCAT equals which resource in CKAN. You could use the accessURL or the title, but these values can change.

It is my understanding that because of this uncertainty, the current implementation of ckanext-dcat simply re-creates all resources.

In DCAT-AP Switzerland, we have an optional identifier attribute on a distribution. So we decided to update those resources that we can map using this attribute, and re-create all the others (see the relevant code of our updated DCAT harvester)

TkTech commented 7 years ago

afaik in DCAT-AP EU a distribution does not have a unique identifier

Woah, is this true? This is a huge flaw in any kind of exchange format. There are very few cases when working with any kind of data where having no primary key equivalent is acceptable.

metaodi commented 7 years ago

@TkTech at least it's not on the mapping table: https://github.com/ckan/ckanext-dcat/blob/master/README.md#rdf-dcat-to-ckan-dataset-mapping

dlax commented 7 years ago

This is actually not a simple problem, because if you want to update a resource, you need to have a way to identify that a resource in CKAN is the same as the one you are harvesting. Afaik in DCAT-AP EU a distribution does not have a unique identifier, so there is no certain way to know which distribution in DCAT equals which resource in CKAN. You could use the accessURL or the title, but these values can change.

In RDF, every resource may be identified by its URI so this is the unique identifier that should be used for harvesting. There's nothing specific to the DCAT vocabulary here.

So I think that, if the harvested RDF document exposes dcat:Distribution nodes with an URI (i.e. like in https://github.com/ckan/ckanext-dcat/blob/v0.0.6/examples/dataset.rdf#L74 and not as blank nodes as such as in https://github.com/ckan/ckanext-dcat/blob/v0.0.6/examples/catalog_datasets_list.rdf#L30), the harvester should keep track of a mapping from distributions URI to CKAN resources and update the latter upon re-harvest. If this information is not available, then replacing resources is probably the only way to go.

It is my understanding that because of this uncertainty, the current implementation of ckanext-dcat simply re-creates all resources.

It seems to me that the actual problem is that ckanext-dcat just assumes that dcat:Distribution are always blank nodes.

metaodi commented 7 years ago

@dlax good point, didn't really think about it that way. Yes this seems to be feasible by simply adding the URI as a field to the resource (much like we did with the explicit "identifier" field in DCAT-AP Switzerland). That way the resource could be identified and updated.

camfindlay commented 7 years ago

Just an idea, could the harvester create and store a simple md5 or sha1 hash of each distribution resource item for a harvested dataset and do a quick compare the next time it harvests the same dataset? No change in hash would mean no change in the resource properties (title, URI etc) so it could ignore reharvesting of that resource. Change in hash could trigger the reharvest or do a further compare to see if something as simple as a title has changed (but perhaps URI is the same), do an update to title, dont kill and reharvest the resource. Other wise if URI has changed or it's a new unknown resource, harvest it in.

The quick comparison of a hash would have low overhead and a good way to indicate if something as changed since last harvest or not (then you could dedicate processing power to the changes rather than the blanket "kill all the things and reharvest" that seems to happen in the DCAT JSON harvester at present).

amercader commented 7 years ago

@dlax @metaodi URIs are already stored in the resources if present in the original distributions (and they are explicitly marked as missing if they are not present, ie resource['uri']: None:

https://github.com/ckan/ckanext-dcat/blob/master/ckanext/dcat/profiles.py#L807

So the approach that you suggest of checking if an existing dataset has a resource with the same URI at harvest time should be really easy to implement. We already have the existing_dataset here:

https://github.com/ckan/ckanext-dcat/blob/master/ckanext/dcat/harvesters/rdf.py#L299

So iterating the new resources and checking if some of them have the same URI as one of the existing ones should be easy. If one is found it would probably be enough with assigning the current id to the new resource to have it updated, but this needs to be double checked.

amercader commented 7 years ago

@camfindlay Do you mean hashing all the fields in the resource or the resource contents? In any case I think it's more complicated than it seems as an approach to track changes in the whole resource. If we rely on using URIs properly the approach on my previous comment seems more straight-forward.

camfindlay commented 7 years ago

I was thinking the fields but come to think of it, if a Dataset hasn't changed at all then perhaps a harvester should waste it cycles on it right? Perhaps that hash idea could apply at a dataset + resource level.

However agree with your more simple approach as a first step @amercader . How do we progress this as we are finding this a pain point currently (resources and hence data api uris getting destroyed on reharvests)?

amercader commented 7 years ago

@camfindlay Unfortunately I don't have time to work on this for the next few weeks but I'll be happy to provide more guidance or review any patches that address this.