Closed davidread closed 5 years ago
@amercader this is ready to merge now
Is anyone able to review/merge this?
I've had a look (and it looks fine) and I've also run the branch without problems on my local setup. LGTM but I'm afraid I can't merge....
+1 this is still a real pain with resources being dumped and recreated on every reharvest! Would be great to get a fix in.
In light of the multiple requests for it to be merged and @rossjones 's review, I'll just merge it.
Before this PR, every harvest with the data.json harvester (DCATJSONHarvester) recreates the resources, with new IDs. This is v. problematic because datapusher/xloader will load them all in fresh tables to DataStore. Old tables are not cleaned out, so it's just not scalable if you harvest regularly.
This was fixed for the RDF XML/TTL harvester (DCATRDFHarvester) in #94 by matching resource IDs by the distribution URI. There is usually no identifier for data.json, so instead this PR finds likely matches based on URL, title and format.
_get_existing_dataset
is moved from DCATRDFHarvester to the base class so that it can be used in the JSON harvester too (which has most of its code in the base class too, but moving that is a separate PR).Why hasn't data.json got a URI or id?
Whilst you could argue it is still RDF, in practice data.json doesn't have a unique identifier. A distribution uri/id is not mentioned in the data.json spec nor is it seen in the popular ArcGIS implementation e.g. CDFW.