datopian / ckanext-versioning

Deprecated. See https://github.com/datopian/ckanext-versions. ⏰ CKAN extension providing data versioning (metadata and files) based on git and github.
https://tech.datopian.com/versioning/
GNU Affero General Public License v3.0
7 stars 4 forks source link

Handling differences between metastore-lib backend and CKAN Metastore #54

Open pdelboca opened 3 years ago

pdelboca commented 3 years ago

Nowadays there is no handling for possible discrepancies in the two backend we are using to store metadata. This causes CKAN to display a dataset (because it exist in Metastore) but fail when trying to edit it because it doesn't exist in metastore-lib backend (for example, data has not been migrated into github repositories).

Example traceback:

File '/usr/lib/ckan/src/ckan/ckan/logic/action/update.py', line 334 in package_update
  item.after_update(context, data)
File '/usr/local/lib/python2.7/dist-packages/ckanext/versioning/plugin.py', line 149 in after_update
  pkg_dict['name'], datapackage, author=author)
File '/usr/local/lib/python2.7/dist-packages/metastore/backend/github/storage.py', line 109 in update
  repo = self._get_repo(package_id)
File '/usr/local/lib/python2.7/dist-packages/metastore/backend/github/storage.py', line 226 in _get_repo
  raise exc.NotFound('Could not find package {}'.format(package_id))
NotFound: Could not find package testing-versions

This will also introduce a hard dependencies: we cannot change/update the metastore-lib backend without a data migration which is something that shouldn't happen but it is worth to have it in mind while we are in the development workflow.

Is this gonna be handle in a specific way?

Some scenarios I can think:

shevron commented 3 years ago

:+1: I like the idea of fault tolerance and graceful degradation. I think migrations are much easier if systems allow for eventual consistency rather than expect to be consistent 100% of the time.

My thought initially was to ensure that if a dataset doesn't exist in the metastore we rely on CKAN data and "migrate" on demand either when the dataset is saved for the first time, or when it is read for the first time (but I think this is slightly less preferred).

I am not sure how to prioritize this, but I will look into the complexity of this.