ckan / ckanext-harvest

Remote harvesting extension for CKAN
130 stars 203 forks source link

Harvester crashes occasionally on record update #391

Open bonnland opened 4 years ago

bonnland commented 4 years ago

We recently updated ckanext-harvest and ckanext-spatial, and we are now seeing the harvester crashing occasionally. If you look at the following snippet from fetch_err.log, you will see that a record with GUID ds351.0 is updated, but then right afterward, the log reports that the record with GUID ds351.0 is unchanged, skipping, and then the fetch queue crashes:

2020-02-13 05:30:15,918 INFO  [ckanext.spatial.harvesters.base.import] Updated package c03d784a-d7aa-42da-81d2-be66a3aea55a with guid edu.ucar.rda::ds351.0
2020-02-13 05:30:15,933 INFO  [ckanext.harvest.queue] Received harvest object id: b68ed9e5-83dc-403a-9867-cf767fb2247a
2020-02-13 05:30:15,940 ERROR [ckanext.harvest.queue] Harvest object does not exist: b68ed9e5-83dc-403a-9867-cf767fb2247a
2020-02-13 05:30:15,945 INFO  [ckanext.harvest.queue] Received harvest object id: 1041c7a1-335e-4d4b-910f-aede40b2899b
2020-02-13 05:30:15,972 DEBUG [ckanext.spatial.harvesters.base.import] Import stage for harvest object: 1041c7a1-335e-4d4b-910f-aede40b2899b
2020-02-13 05:30:15,972 DEBUG [ckanext.spatial.harvesters.base] Using config: {u'read_only': True, u'user': u'harvest'}
2020-02-13 05:30:15,977 DEBUG [ckanext.spatial.validation.validation] Starting validation against profile(s) iso19139,dset-minimum-fields-production,geographic-extent-validator
2020-02-13 05:30:16,006 DEBUG [ckanext.spatial.validation.validation] Validated against "ISO19139 XSD Schema"
2020-02-13 05:30:16,007 DEBUG [ckanext.spatial.validation.validation] Validated against "DSET Minimum Fields Validation - Production"
2020-02-13 05:30:16,008 DEBUG [ckanext.spatial.validation.validation] Validated against "DSET Geographic Extent Validator"
2020-02-13 05:30:16,008 INFO  [ckanext.spatial.validation.validation] Validation passed
2020-02-13 05:30:16,018 DEBUG [ckanext.dsetsearch.plugin] is_parent = False
2020-02-13 05:30:16,018 DEBUG [ckanext.dsetsearch.plugin] is_child = False
2020-02-13 05:30:16,018 DEBUG [ckanext.dsetsearch.plugin] parent_id =
2020-02-13 05:30:16,193 INFO  [ckanext.spatial.harvesters.base.import] Document with GUID edu.ucar.rda::ds351.0 unchanged, skipping...
Traceback (most recent call last):
  File "/usr/lib/ckan/default/bin/paster", line 9, in <module>
    load_entry_point('PasteScript==2.0.2', 'console_scripts', 'paster')()
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/paste/script/command.py", line 102, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/paste/script/command.py", line 141, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/paste/script/command.py", line 236, in run
    result = self.command()
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 192, in command
    fetch_callback(consumer, method, header, body)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 451, in fetch_callback
    fetch_and_import_stages(harvester, obj)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 479, in fetch_and_import_stages
    obj.save()
  File "/usr/lib/ckan/default/src/ckan/ckan/model/domain_object.py", line 64, in save
    self.add()
  File "/usr/lib/ckan/default/src/ckan/ckan/model/domain_object.py", line 68, in add
    self.Session.add(self)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/scoping.py", line 157, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1711, in add
    self._save_or_update_state(state)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1723, in _save_or_update_state
    self._save_or_update_impl(state)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2029, in _save_or_update_impl
    self._update_impl(state)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2003, in _update_impl
    state_str(state)
sqlalchemy.exc.InvalidRequestError: Instance '<HarvestObject at 0x7f21944e7790>' has been deleted.  Use the make_transient() function to send this object back to the transient state.

A side effect of the crash is that a duplicate package is created for this record, and we sometimes see the message "That URL exists already" when CKAN tries to create a duplicate URL. Having duplicate packages for the same record causes confusion, so we obviously want to avoid this problem if possible.

We rebuilt the database from scratch, but this did not make the problem go away. For now, I believe we can revert the harvester code to an earlier version to make the problem go away.

We are using CKAN 2.8 (in December 2019 we upgraded the source from CKAN 2.5 to CKAN 2.8). We are harvesting around 10,000 records, and the error happens randomly and occasionally on updates. Let me know if I can provide any more information.

bonnland commented 4 years ago

The harvester stops crashing if I rewind to this commit:

https://github.com/ckan/ckanext-harvest/commit/819706ae5dce5c62b2283fd8405fd8b6fa804355

I have not explored exactly which commit between this one and the latest commit on the master branch introduced the crashes. Reproducing the crash takes time; it happens rarely enough that it requires harvesting around 8,000 records and then updating some of them.

If anyone is interested in reproducing this, feel free to clone this repository of ISO records and harvest them. Wait a day or so, pull updates from the repo (introducing some updated ISO records), and reharvest. About every third reharvest causes the error for me.

Repo of ISO records: https://github.com/NCAR/dset-web-accessible-folder-dev

xellos866 commented 4 years ago

I assume we are running into the same problem with one of our custom harvesters. We are importing data via a XML-File. After about a week with no problems our fetch task crashes with the following error:

/var/log/ckan/fetch_consumer.log

(...)
2020-03-02 08:58:00,862 INFO  [<custom.harvester.name>] Document with GUID ffd2d0239583764f38ba4a8c8b280538 unchanged, skipping...

Traceback (most recent call last):
  File "/usr/lib/ckan/default/bin/paster", line 10, in <module>
    sys.exit(run())
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 102, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 141, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 236, in run
    result = self.command()
  File "/app/releases/2.3.0/extensions/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 192, in command
    fetch_callback(consumer, method, header, body)
  File "/app/releases/2.3.0/extensions/ckanext-harvest/ckanext/harvest/queue.py", line 447, in fetch_callback
    fetch_and_import_stages(harvester, obj)
  File "/app/releases/2.3.0/extensions/ckanext-harvest/ckanext/harvest/queue.py", line 475, in fetch_and_import_stages
    obj.save()
  File "/usr/lib/ckan/default/src/ckan/ckan/model/domain_object.py", line 64, in save
    self.add()
  File "/usr/lib/ckan/default/src/ckan/ckan/model/domain_object.py", line 68, in add
    self.Session.add(self)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/scoping.py", line 157, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1711, in add
    self._save_or_update_state(state)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1723, in _save_or_update_state
    self._save_or_update_impl(state)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2029, in _save_or_update_impl
    self._update_impl(state)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2003, in _update_impl
    state_str(state)
sqlalchemy.exc.InvalidRequestError: Instance '<HarvestObject at 0x7f39b8bab250>' has been deleted.  Use the make_transient() function to send this object back to the transient state.
(...)

System:

At this harvester we just have about 70 datasets which will be imported. In total we have seven custom harvesters. Every other one has with bigger datasets (up to 10k) and 4 harvesters get data from XML-Files, but this one crashes and we are clueless at the moment why. I will try to rewind it to the suggested commit and check if it makes a difference.

xellos866 commented 4 years ago

We have now tried test runs with the current master (96f9fbc) and the issue seems to be gone. I don't know what exactly was the hiccup but maybe this solves the problem for other users, too.

~greetz

bonnland commented 4 years ago

We have now tried test runs with the current master (96f9fbc) and the issue seems to be gone

I can confirm that the harvest problems go away with this current master. Note that the package requirements in pip-requirements.txt have been updated, so these commands are necessary during upgrade:

pip install --upgrade -r pip-requirements.txt
python setup.py develop

Of course these commands are always recommended when upgrading, but in this case it's absolutely required, since the package dependencies are quite different than in earlier versions.