Closed dlax closed 6 years ago
I tried to use a package_patch
action instead of package_update
(see https://github.com/dlax/ckanext-dcat/commit/ecaf0de5a3b4d34788274c31bd0c3696a19110b6) but this is not enough as only dataset attributes are considered in the patch action and resources data is just passed to package_update
which leads to ckan.lib.dictization.model_save.package_resource_list_save() in which resources' replacement occurs.
This is actually not a simple problem, because if you want to update a resource, you need to have a way to identify that a resource in CKAN is the same as the one you are harvesting. Afaik in DCAT-AP EU a distribution does not have a unique identifier, so there is no certain way to know which distribution in DCAT equals which resource in CKAN. You could use the accessURL or the title, but these values can change.
It is my understanding that because of this uncertainty, the current implementation of ckanext-dcat simply re-creates all resources.
In DCAT-AP Switzerland, we have an optional identifier
attribute on a distribution. So we decided to update those resources that we can map using this attribute, and re-create all the others (see the relevant code of our updated DCAT harvester)
afaik in DCAT-AP EU a distribution does not have a unique identifier
Woah, is this true? This is a huge flaw in any kind of exchange format. There are very few cases when working with any kind of data where having no primary key equivalent is acceptable.
@TkTech at least it's not on the mapping table: https://github.com/ckan/ckanext-dcat/blob/master/README.md#rdf-dcat-to-ckan-dataset-mapping
This is actually not a simple problem, because if you want to update a resource, you need to have a way to identify that a resource in CKAN is the same as the one you are harvesting. Afaik in DCAT-AP EU a distribution does not have a unique identifier, so there is no certain way to know which distribution in DCAT equals which resource in CKAN. You could use the accessURL or the title, but these values can change.
In RDF, every resource may be identified by its URI so this is the unique identifier that should be used for harvesting. There's nothing specific to the DCAT vocabulary here.
So I think that, if the harvested RDF document exposes dcat:Distribution
nodes with an URI (i.e. like in https://github.com/ckan/ckanext-dcat/blob/v0.0.6/examples/dataset.rdf#L74 and not as blank nodes as such as in https://github.com/ckan/ckanext-dcat/blob/v0.0.6/examples/catalog_datasets_list.rdf#L30), the harvester should keep track of a mapping from distributions URI to CKAN resources and update the latter upon re-harvest. If this information is not available, then replacing resources is probably the only way to go.
It is my understanding that because of this uncertainty, the current implementation of ckanext-dcat simply re-creates all resources.
It seems to me that the actual problem is that ckanext-dcat just assumes that dcat:Distribution
are always blank nodes.
@dlax good point, didn't really think about it that way. Yes this seems to be feasible by simply adding the URI as a field to the resource (much like we did with the explicit "identifier" field in DCAT-AP Switzerland). That way the resource could be identified and updated.
Just an idea, could the harvester create and store a simple md5 or sha1 hash of each distribution resource item for a harvested dataset and do a quick compare the next time it harvests the same dataset? No change in hash would mean no change in the resource properties (title, URI etc) so it could ignore reharvesting of that resource. Change in hash could trigger the reharvest or do a further compare to see if something as simple as a title has changed (but perhaps URI is the same), do an update to title, dont kill and reharvest the resource. Other wise if URI has changed or it's a new unknown resource, harvest it in.
The quick comparison of a hash would have low overhead and a good way to indicate if something as changed since last harvest or not (then you could dedicate processing power to the changes rather than the blanket "kill all the things and reharvest" that seems to happen in the DCAT JSON harvester at present).
@dlax @metaodi URIs are already stored in the resources if present in the original distributions (and they are explicitly marked as missing if they are not present, ie resource['uri']: None
:
https://github.com/ckan/ckanext-dcat/blob/master/ckanext/dcat/profiles.py#L807
So the approach that you suggest of checking if an existing dataset has a resource with the same URI at harvest time should be really easy to implement. We already have the existing_dataset
here:
https://github.com/ckan/ckanext-dcat/blob/master/ckanext/dcat/harvesters/rdf.py#L299
So iterating the new resources and checking if some of them have the same URI as one of the existing ones should be easy. If one is found it would probably be enough with assigning the current id
to the new resource to have it updated, but this needs to be double checked.
@camfindlay Do you mean hashing all the fields in the resource or the resource contents? In any case I think it's more complicated than it seems as an approach to track changes in the whole resource. If we rely on using URIs properly the approach on my previous comment seems more straight-forward.
I was thinking the fields but come to think of it, if a Dataset hasn't changed at all then perhaps a harvester should waste it cycles on it right? Perhaps that hash idea could apply at a dataset + resource level.
However agree with your more simple approach as a first step @amercader . How do we progress this as we are finding this a pain point currently (resources and hence data api uris getting destroyed on reharvests)?
@camfindlay Unfortunately I don't have time to work on this for the next few weeks but I'll be happy to provide more guidance or review any patches that address this.
It seems that upon reharvest of a DCAT source, resources of dataset are deleted and recreated.
This can be seen in the following excerpt from "fetch consumer" process logs:
As can be seen, my setup includes datapusher/datastore plugins.
I noticed that, on every reharvest, prior resources are not anymore available even though nothing changed in the source in the meantime; in particular, their uuid changed so that one gets a 404 on old resources URL.
It's also worrying as it seems that resources are not cleared from the datastore as the datapusher actually accumulated them. Upon every reharvest, in datapusher's process log, I get a message like: