Closed montxo5 closed 10 years ago
@montxo5 That was caused by the harvesters not being careful when checking if two requests had the same contents (to check if the remote server supported pagination).
In Madrid's case, there are some real time datasets that got the timestamp updated on each request:
<dct:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-06-11T03:05:31</dct:modified>
Can you update your sources and check if you only get 101 records?
Tanks for the reply. Sorry, but I didn't understand what do you mean when you say updating my resources. You mean reharvesting?
I meant doing git pull
to update the ckanext-dcat source and reharvesting.
Let me know how it goes,
I've updated the ckanext-dcat with git pull and it's still duplicating datasets. I've also tried uninstalling and installing dcat, restarting, but also fails.
Did you restart the two harvester consumers? ctrl+c
if running them directly on the terminal or sudo supervisorctl restart all
if using Supervisor on production.
You were right, I forgot to restart the consumers... Thanks!! Now works perfectly!
Glad you got it working! :)
When I try to harvest this XML-RDF: http://datos.madrid.es/egob/catalogo.rdf the process inserts the datasets twice. Insted of 101, it appears 202 datasets.
I've also tried whit this one: http://datos.gijon.es/set.rdf and in this case it works OK.
I think that the problem is with some kind of redirect in the madrid's case. Could it be possible to control this cases?
Thanks in advance!