ckan / ckanext-spatial

Geospatial extension for CKAN
http://docs.ckan.org/projects/ckanext-spatial
125 stars 192 forks source link

Duplicate datasets after some time #245

Closed jeanpommier closed 3 years ago

jeanpommier commented 4 years ago

Hi, I'm harvesting geonetwork CSW sources, running daily updates. Harvesting works fine, but from time to time, a dataset is added instead of updated, so I get duplicates. After some time, I can end up with several duplicates of the same original dataset, see for instance https://www.geo2france.fr/ckan/dataset/inventaire-du-patrimoine-culturel-du-canton-de-villers-bocage-80 https://www.geo2france.fr/ckan/dataset/inventaire-du-patrimoine-culturel-du-canton-de-villers-bocage-801

It seems that at some point, the harvester lost track of the original dataset: if I look for the corresponding lines in the harvest_object table in the DB, I'll only get references to the last one:

WITH pkg AS (select id from package where name LIKE 'inventaire-du-patrimoine-culturel-du-canton-de-villers-bocage-8%')
SELECT * FROM harvest_object WHERE package_id IN (SELECT id FROM pkg);

will return 2 lines, one current, another older, both matching inventaire-du-patrimoine-culturel-du-canton-de-villers-bocage-801, none matching inventaire-du-patrimoine-culturel-du-canton-de-villers-bocage-80 Any idea what's happening here ?

FYI, I'm using the scheme extension with a custom schema, but I don't see why it would affect the new/updated/delete detection in https://github.com/ckan/ckanext-spatial/blob/master/ckanext/spatial/harvesters/csw.py#L113

jeanpommier commented 3 years ago

Needed to update ckanext-harvest extension. >=1.3.0 is fine