Resource ids now kept the same on reharvest (data.json harvester)

davidread commented 6 years ago

Before this PR, every harvest with the data.json harvester (DCATJSONHarvester) recreates the resources, with new IDs. This is v. problematic because datapusher/xloader will load them all in fresh tables to DataStore. Old tables are not cleaned out, so it's just not scalable if you harvest regularly.

This was fixed for the RDF XML/TTL harvester (DCATRDFHarvester) in #94 by matching resource IDs by the distribution URI. There is usually no identifier for data.json, so instead this PR finds likely matches based on URL, title and format.

_get_existing_dataset is moved from DCATRDFHarvester to the base class so that it can be used in the JSON harvester too (which has most of its code in the base class too, but moving that is a separate PR).

Why hasn't data.json got a URI or id?

Whilst you could argue it is still RDF, in practice data.json doesn't have a unique identifier. A distribution uri/id is not mentioned in the data.json spec nor is it seen in the popular ArcGIS implementation e.g. CDFW.

davidread commented 6 years ago

@amercader this is ready to merge now

rossjones commented 6 years ago

Is anyone able to review/merge this?

rossjones commented 6 years ago

I've had a look (and it looks fine) and I've also run the branch without problems on my local setup. LGTM but I'm afraid I can't merge....

camfindlay commented 5 years ago

+1 this is still a real pain with resources being dumped and recreated on every reharvest! Would be great to get a fix in.

davidread commented 5 years ago

In light of the multiple requests for it to be merged and @rossjones 's review, I'll just merge it.

ckan / ckanext-dcat

Resource ids now kept the same on reharvest (data.json harvester) #115

Why hasn't data.json got a URI or id?