OBOFoundry / OBOFoundry.github.io

Metadata and website for the Open Bio Ontologies Foundry Ontology Registry
http://obofoundry.org
Other
165 stars 204 forks source link

Consider OSF or internet archive for storage or archive of ontologies #494

Open cmungall opened 7 years ago

cmungall commented 7 years ago

It would be good to archive all versioned PURLs, and also offer a solution for people to release their files that doesn't require large derived owl files checked into github

OSF is one possibility: https://osf.io/ Guaranteed for 50 years. And we are very much in line with their mission.

There is a nice python API and CLI developed by Titus Brown's lab: http://osfclient.readthedocs.io/ - see also this blog post

Another option is https://git-annex.branchable.com/tips/Internet_Archive_via_S3/, @jhpoelen is a big fan of this approach

jhpoelen commented 7 years ago

What I like about https://git-annex.branchable.com is that you can manage multiple copies across all sorts of different storage - thumbdrives, internet archive, s3. I like the idea of not putting all my eggs into one basket and being able to (automatically) migrate data from one place to another.

cmungall commented 7 years ago

@jhpoelen - thanks for the tip - do you have any pointers to docs where you've used this?

cmungall commented 7 years ago

I've played a little with both. One requirement from an OBO pov is that we don't want to have to map folders full of produces to SHA-type URLs on a per-file basis.

This seems to be a current limitation of OSF. If I make a project that gets assigned an id xyz12, and then I add a file imports/foo_imports.owl then that file will get some id like qwrt5. Ideally we'd have a URL like xyz12/imports/foo_imports.owl, otherwise it gets complicated with PURL configuration. I've written to ask them if this is something they support.

I have played a bit with git annex but still learning, it seems as if each file gets assigned a SHA type URL.

jhpoelen commented 7 years ago

As far as I understand, the hashes are used to store stuff on the blob systems (aka remotes) and git is used for names. So, you can have a folder structure with symlinks in git and use git-annex to materialize (or inflate) symlinks locally from whatever source it is available.

jhpoelen commented 7 years ago

The example at https://github.com/globalbioticinteractions/archive is a bit confusing, because the names of the symlinks happen to be sha 256 hashes also. E.g., https://github.com/globalbioticinteractions/archive/tree/master/datasets/globalbioticinteractions/natural-history-museum-london-interactions-bank . Reason for doing this is that GloBI has to deal with uris that return different results depending on when they are dereferenced.

alanruttenberg commented 7 years ago

@cmungall re: xyz12/imports/foo_imports.owl

I don't see that we need purls like that. All we need is that during release all imports get dated purls. The source files are uploaded and then all the purls are redirected to their sha-based URL. We don't need to rewrite the purls that are to be imported after build. All we need is a mapping from the files uploaded to the OSF urls. Or am I misunderstanding you?

The code I have for IAO release does this, including making a local dated copy of any undated imports.

cmungall commented 7 years ago

the purls would be of the form $OBO/{ontid}/imports/foo_imports.owl. Each individual file is given a new SHA-type URL. The desire is that their native URLs would be of the form osf.io/{projectId}/imports/foo_imports.owl to make mapping more straightforward. But this less important when I complete this PR: https://github.com/dib-lab/osf-cli/pull/119

david4096 commented 7 years ago

Please check out https://github.com/ipfs/ipfs .

You can use their naming system to maintain a single hash that will keep the latest version of your ontology. A simple registry that allows people to list the OWL at the given hashes would be an added service to make this useful. Making this registry a file on IPFS makes a lot of sense to me since you could have most of the infrastructure for sharing OWL files decentralized.

I strongly encourage moving the ontology infrastructure into a decentralized or distributed architecture.

Two other technologies worth examining:

http://datalad.org/ - uses git annex to share scientific datasets

https://github.com/datproject/dat : syndicate just the diffs amongst many sites

P.S. Last I checked you can add multiple binary files, up to 2GB in size, when you make a github release. Here's a repo that has a ~100MB file attached to a release: https://github.com/santacruzml/fall-17-scml-competition/releases https://help.github.com/articles/editing-and-deleting-releases/

nlharris commented 4 years ago

What is the status of this?

nlharris commented 4 years ago

See also #753

cmungall commented 3 years ago

we need to update this ticket with recent things we have learned

some ontologies are now using gh releases for larger files. This is easy to do in python

@cthoyt has a handy python script for using zenodo https://github.com/pyobo/pyobo/blob/master/src/pyobo/zenodo_client.py

but zenodo may be best for archive rather than live serving up latest ontology

cthoyt commented 3 years ago

we need to update this ticket with recent things we have learned

some ontologies are now using gh releases for larger files. This is easy to do in python

@cthoyt has a handy python script for using zenodo https://github.com/pyobo/pyobo/blob/master/src/pyobo/zenodo_client.py

but zenodo may be best for archive rather than live serving up latest ontology

In fact I just made it into its own package today. See: https://github.com/cthoyt/zenodo-client/

I will have a few examples of it being used in the wild soon. ~It also needs a feature that lets you deal with the situation where a deposition ID is not yet available, since it's a bit clunky there (follow at https://github.com/cthoyt/zenodo-client/issues/1)~ There is now a zenodo_client.ensure function documented on the README that makes the configuration quite straightforwards.

matentzn commented 2 years ago

We used to have this issue with Zenodo that there is no way to refer to a specific file in the latest release. Is that still the case?

cthoyt commented 2 years ago

Correct, I still don't know how to link to a download for latest record. To demonstrate with PyStow:

Version 0.4.0

DOI: https://doi.org/10.5281/zenodo.6056700 Page: https://zenodo.org/record/6056700 Download: https://zenodo.org/record/6056700/files/cthoyt/pystow-v0.4.0.zip?download=1

Latest

DOI: https://doi.org/10.5281/zenodo.4304449 Page: https://zenodo.org/record/4304449 Download (does not work, just redirects to page): https://zenodo.org/record/4304449/files/cthoyt/pystow-v0.4.0.zip?download=1

matentzn commented 2 years ago

Yeah, and @kltm kept reminding us that an archive is an archive and not a fileserver :P So using Zenodo as a fileserver is sort of abuse anyways..

cmungall commented 2 years ago

relevant https://dvc.org/doc/start/data-and-model-versioning

h/t @realmarcin

kltm commented 2 years ago

Self-tagging @kltm

jhpoelen commented 2 years ago

Ever considering using a more flexible content-based addressing approach like https://github.com/bio-guoda/preston ? Name/location-based single parent structures to organize content seems at odds when dealing with rich semantics supported through OBO Foundry's member ontologies.