ckan / ckanext-dcat

CKAN ♥ DCAT
https://docs.ckan.org/projects/ckanext-dcat
167 stars 146 forks source link

Duplicates the datasets #15

Closed montxo5 closed 10 years ago

montxo5 commented 10 years ago

When I try to harvest this XML-RDF: http://datos.madrid.es/egob/catalogo.rdf the process inserts the datasets twice. Insted of 101, it appears 202 datasets.

I've also tried whit this one: http://datos.gijon.es/set.rdf and in this case it works OK.

I think that the problem is with some kind of redirect in the madrid's case. Could it be possible to control this cases?

Thanks in advance!

amercader commented 10 years ago

@montxo5 That was caused by the harvesters not being careful when checking if two requests had the same contents (to check if the remote server supported pagination).

In Madrid's case, there are some real time datasets that got the timestamp updated on each request:

<dct:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-06-11T03:05:31</dct:modified>

Can you update your sources and check if you only get 101 records?

montxo5 commented 10 years ago

Tanks for the reply. Sorry, but I didn't understand what do you mean when you say updating my resources. You mean reharvesting?

amercader commented 10 years ago

I meant doing git pull to update the ckanext-dcat source and reharvesting. Let me know how it goes,

montxo5 commented 10 years ago

I've updated the ckanext-dcat with git pull and it's still duplicating datasets. I've also tried uninstalling and installing dcat, restarting, but also fails.

amercader commented 10 years ago

Did you restart the two harvester consumers? ctrl+c if running them directly on the terminal or sudo supervisorctl restart all if using Supervisor on production.

montxo5 commented 10 years ago

You were right, I forgot to restart the consumers... Thanks!! Now works perfectly!

amercader commented 10 years ago

Glad you got it working! :)