Open cmungall opened 7 years ago
What I like about https://git-annex.branchable.com is that you can manage multiple copies across all sorts of different storage - thumbdrives, internet archive, s3. I like the idea of not putting all my eggs into one basket and being able to (automatically) migrate data from one place to another.
@jhpoelen - thanks for the tip - do you have any pointers to docs where you've used this?
I've played a little with both. One requirement from an OBO pov is that we don't want to have to map folders full of produces to SHA-type URLs on a per-file basis.
This seems to be a current limitation of OSF. If I make a project that gets assigned an id xyz12
, and then I add a file imports/foo_imports.owl then that file will get some id like qwrt5
. Ideally we'd have a URL like xyz12/imports/foo_imports.owl
, otherwise it gets complicated with PURL configuration. I've written to ask them if this is something they support.
I have played a bit with git annex but still learning, it seems as if each file gets assigned a SHA type URL.
As far as I understand, the hashes are used to store stuff on the blob systems (aka remotes) and git is used for names. So, you can have a folder structure with symlinks in git and use git-annex to materialize (or inflate) symlinks locally from whatever source it is available.
The example at https://github.com/globalbioticinteractions/archive is a bit confusing, because the names of the symlinks happen to be sha 256 hashes also. E.g., https://github.com/globalbioticinteractions/archive/tree/master/datasets/globalbioticinteractions/natural-history-museum-london-interactions-bank . Reason for doing this is that GloBI has to deal with uris that return different results depending on when they are dereferenced.
@cmungall re: xyz12/imports/foo_imports.owl
I don't see that we need purls like that. All we need is that during release all imports get dated purls. The source files are uploaded and then all the purls are redirected to their sha-based URL. We don't need to rewrite the purls that are to be imported after build. All we need is a mapping from the files uploaded to the OSF urls. Or am I misunderstanding you?
The code I have for IAO release does this, including making a local dated copy of any undated imports.
the purls would be of the form $OBO/{ontid}/imports/foo_imports.owl
. Each individual file is given a new SHA-type URL. The desire is that their native URLs would be of the form osf.io/{projectId}/imports/foo_imports.owl
to make mapping more straightforward. But this less important when I complete this PR: https://github.com/dib-lab/osf-cli/pull/119
Please check out https://github.com/ipfs/ipfs .
You can use their naming system to maintain a single hash that will keep the latest version of your ontology. A simple registry that allows people to list the OWL at the given hashes would be an added service to make this useful. Making this registry a file on IPFS makes a lot of sense to me since you could have most of the infrastructure for sharing OWL files decentralized.
I strongly encourage moving the ontology infrastructure into a decentralized or distributed architecture.
Two other technologies worth examining:
http://datalad.org/ - uses git annex to share scientific datasets
https://github.com/datproject/dat : syndicate just the diffs amongst many sites
P.S. Last I checked you can add multiple binary files, up to 2GB in size, when you make a github release. Here's a repo that has a ~100MB file attached to a release: https://github.com/santacruzml/fall-17-scml-competition/releases https://help.github.com/articles/editing-and-deleting-releases/
What is the status of this?
See also #753
we need to update this ticket with recent things we have learned
some ontologies are now using gh releases for larger files. This is easy to do in python
@cthoyt has a handy python script for using zenodo https://github.com/pyobo/pyobo/blob/master/src/pyobo/zenodo_client.py
but zenodo may be best for archive rather than live serving up latest ontology
we need to update this ticket with recent things we have learned
some ontologies are now using gh releases for larger files. This is easy to do in python
@cthoyt has a handy python script for using zenodo https://github.com/pyobo/pyobo/blob/master/src/pyobo/zenodo_client.py
but zenodo may be best for archive rather than live serving up latest ontology
In fact I just made it into its own package today. See: https://github.com/cthoyt/zenodo-client/
I will have a few examples of it being used in the wild soon. ~It also needs a feature that lets you deal with the situation where a deposition ID is not yet available, since it's a bit clunky there (follow at https://github.com/cthoyt/zenodo-client/issues/1)~ There is now a zenodo_client.ensure
function documented on the README that makes the configuration quite straightforwards.
We used to have this issue with Zenodo that there is no way to refer to a specific file in the latest release. Is that still the case?
Correct, I still don't know how to link to a download for latest record. To demonstrate with PyStow:
DOI: https://doi.org/10.5281/zenodo.6056700 Page: https://zenodo.org/record/6056700 Download: https://zenodo.org/record/6056700/files/cthoyt/pystow-v0.4.0.zip?download=1
DOI: https://doi.org/10.5281/zenodo.4304449 Page: https://zenodo.org/record/4304449 Download (does not work, just redirects to page): https://zenodo.org/record/4304449/files/cthoyt/pystow-v0.4.0.zip?download=1
Yeah, and @kltm kept reminding us that an archive is an archive and not a fileserver :P So using Zenodo as a fileserver is sort of abuse anyways..
relevant https://dvc.org/doc/start/data-and-model-versioning
h/t @realmarcin
Self-tagging @kltm
Ever considering using a more flexible content-based addressing approach like https://github.com/bio-guoda/preston ? Name/location-based single parent structures to organize content seems at odds when dealing with rich semantics supported through OBO Foundry's member ontologies.
It would be good to archive all versioned PURLs, and also offer a solution for people to release their files that doesn't require large derived owl files checked into github
OSF is one possibility: https://osf.io/ Guaranteed for 50 years. And we are very much in line with their mission.
There is a nice python API and CLI developed by Titus Brown's lab: http://osfclient.readthedocs.io/ - see also this blog post
Another option is https://git-annex.branchable.com/tips/Internet_Archive_via_S3/, @jhpoelen is a big fan of this approach