geopython / pycsw

pycsw is an OGC CSW server implementation written in Python. pycsw fully implements the OpenGIS Catalogue Service Implementation Specification [Catalogue Service for the Web]. Initial development started in 2010 (more formally announced in 2011). The project is certified OGC Compliant, and is an OGC Reference Implementation. pycsw allows for the publishing and discovery of geospatial metadata via numerous APIs (CSW 2/CSW 3, OpenSearch, OAI-PMH, SRU). Existing repositories of geospatial metadata can also be exposed, providing a standards-based metadata and catalogue component of spatial data infrastructures. pycsw is Open Source, released under an MIT license, and runs on all major platforms (Windows, Linux, Mac OS X). Please read the docs at https://pycsw.org/docs for more information.
https://pycsw.org
MIT License
210 stars 155 forks source link

Metadata version control with git #170

Open isedwards opened 11 years ago

isedwards commented 11 years ago

Try using Git repository as a pycsw backend. CSW providing a search interface and Git as an alternative to CSW-T

Raised here: http://osgeo-org.1560.x6.nabble.com/General-CSW-questions-tp5067534p5067661.html

isedwards commented 11 years ago

I prefer git for most things, but perhaps fossil-scm is also worthy of attention on this ticket (with its single file repository based on sqlite database and immutable history)?

The geonetwork experience (using svn) is here: "Not all records in GeoNetwork are tracked as the compute and systems admin cost of this tracking for every record, particularly in large catalogs, is too high." http://geonetwork-opensource.org/manuals/trunk/eng/users/managing_metadata/versioning/index.html

tomkralidis commented 11 years ago

Git seems like a good first step to implement an scm backend design pattern, which we can then apply to fossil-scm, svn, etc.

Some options/thinking out loud:

Auth: I haven't given much thought yet to access control against specific elements, however it would be best to leverage an auth mechanism and use it as opposed to creating one inline

Migrations: the way the pycsw repository works, it is kind of agnostic to the structure of metadata records per se, but we should look into DB migrations regardless, for times where the underlying model itself changes.

rclark commented 11 years ago

The first bullet seems very tractable, and would make for a great demonstration of the idea.

The second point would be required in the end, although honestly the major benefit of a git backend would be that you could manage the metadata content without CSW-T.

Third point is less intriguing to me -- again, less interested in CSW-based access to versioning. CSW's primary focus should be on search and discovery, and we can let real-life version control systems do the version control.

It would also be worth exploring Git as a more efficient mechanism for harvesting than CSW's protocol.

What would be stellar would be a git repo as a replacement for, not in addition to, the spatial database, but then you would certainly need some other mechanism for indexing... Maybe something like CouchDB is another backend to consider?

kalxas commented 11 years ago

Mercurial would also be a good choice as a back-end, since it is written in Python and is very similar to Git.

Regarding CouchDB, there is an open issue #120 :)

tomkralidis commented 11 years ago

@rclark good points here. I think a Git repo as the backend is a good next step.

Backends in pycsw are extensible. So something like pycsw/plugins/repository/git/git.py would be required, with the same setup/signatures as https://github.com/geopython/pycsw/blob/master/pycsw/plugins/repository/geonode/geonode_.py or https://github.com/geopython/pycsw/blob/master/pycsw/plugins/repository/odc/odc.py, adding insert, update, delete functions which would be the CSW-T functions to interact with Git.

I think this would be very easy to do for Git transactions, with a few config switches to detect it's a git backend, as well as u/p credentials.

The question then becomes how do we index and make the repository searchable.

Some options / further thinking out loud:

rclark commented 11 years ago

GeoNetwork and ESRI Geoportal both utilize lucene for indexing if I'm not mistaken. I think CouchDB has validity as its own backend for pycsw, but maybe not so much for this purpose.

Even more thinking out loud

tomkralidis commented 11 years ago

@rclark thanks for the info. Agreed, lightweight is a rule of pycsw.

Has anyone tried whoosh (http://whoosh.ca)? From what I can see, pure Python index/search, and I think it would be a great fit. The only thing is that it doesn't do spatial. What would be really cool is for Whoosh to support Shapely (even if it's not PP, optional spatial support).

kalxas commented 2 years ago

Update: External git workflow is being used in ESA's Open Science Catalogue https://opensciencedata.esa.int/ with pycsw as the Catalogue backend.

Records are stored/manipulated on GitHub and there is a hook that triggers pycsw harvesting from gihub pages to synchronize the records in the db.