ckan / ckanext-dcat

CKAN ♥ DCAT
https://docs.ckan.org/projects/ckanext-dcat
165 stars 145 forks source link

RDF job never ends if some dataset raises exception in gather stage #147

Open pduchesne opened 5 years ago

pduchesne commented 5 years ago

What happens : a DCAT RDF feed is harvested, and fails with

[ckanext.harvest.model] Error when processsing dataset: KeyError('title',) / Traceback (most recent call last):
  File "/home/ckan/ckan/sources/ckanext-dcat/ckanext/dcat/harvesters/rdf.py", line 211, in gather_stage
     dataset['name'] = self._gen_new_name(dataset['title'])
  KeyError: 'title'
[ckanext.harvest.queue] No harvest objects to fetch

obviously because one of the datasets is missing a title, and the code does not expect that. But the problem is that the job is never marked as finished, and stays pending.

Possible explanation Looking at https://github.com/ckan/ckanext-dcat/blob/db7ab41e77ccd1724025fed4f30c9485ad007a4f/ckanext/dcat/harvesters/rdf.py#L233-L241 we see that any dataset error results in an empty array to be returned. But also that other HarvestObject may be created and saved before the error happens.

Is it possible that these HarvestObject are never marked as in error, left in limbo and cause the 'harvest job run' to consider the failed job still running ? That's my impression when looking at this :

https://github.com/ckan/ckanext-harvest/blob/5aad13c2f9aba738a82eeca8bb7a859e584f483b/ckanext/harvest/logic/action/update.py#L522-L534