Closed cboettig closed 3 years ago
At least three possible strategies here:
Note that option 3 assumes a tsv
registry is available in the first place, while the example above explicitly sets dataone
as the only registry.
symlinks done in #65
Clever hack! The symlink approach would work, but doesn't seem as straight forward as non-file system based linking.
Why not create an explicit link table that keeps a sorted list of content hashes grouped by content on each line? This way, you can easily search external repos across the various hashes when requested.
Yeah, great question, that was my first preference too, and may still be the best approach. But this gets back to how we think of the extension model -- creating an explicit file means using some kind of local registry for where to store that information. We could just assume that is a .tsv
file or bagit manifest in the content_dir()
directory, or we could use the local .tsv or lmdb registry. The trouble is just that -- our extendable model means we have a multiplicity of places where we might look, and because they are configurable, we don't know which ones may or may not be available on any given call.
Maybe it would still be better to assume the store follows bagit format, and maintain manifest-sha256.txt, manifest-md5.txt, etc. (I think bagit standard permits multiple manifests with different hashes?) Actually this is what we were doing with the store when we used only sha-256 already. But in an R environment, it adds some possibly significant overhead for large stores, since you need to then parse the full manifest into R to do a query for a given id. Under the symlink approach, you can formulate the hash into the correct filepath immediately, so it's much faster and independent of the store size. So to me, the symlink approach felt simpler and more efficient once I got into implementation...
I enjoy simple and efficient approaches also. And, I was just wondering about mobility - moving the store around without losing the various way to access the data. The current solution will work until it doesn't and even then it's not a big deal, 'cause the hashes can be recalculated (at expense of a warming planet ;)) . Thanks for replying.
Does minio / amazon s3 support symlinks?
Does minio / amazon s3 support symlinks?
yeah, that's a good point, looks like no, at least on minio: https://minio.thelio.carlboettiger.info/minio/content-store/md5/2a/c3/2ac33190eab5a5c739bad29754532d76
Of course you could register the minio sha256 url instead, and then you'd still be able to query by either hash.
yeah, store mobility is a good question. exposing the content-dir as an S3 bucket on minio is obviously one way to do that, but it's just convention (i.e. it exploits the same assumption as the local store; that we can construct the access path from the hash without parsing a manifest). Creating a bagit is probably the best generic advice for store mobility, with the caveat that access tool really ought to parse the manifest(s) to get the paths.
Meanwhile, if I just tar up my CONTENTID_HOME
dir, send it to you, you untar it at your CONTENTID_HOME
location, I think everything should work as before from the contentid
package perspective, right? i.e. tar/untar preserves symlinks fine I think?
Not to justify the current implementation, but note that git-annex is using symlinks too . . . in combination with git.
Am sure we'll come up with another solution is the current approach doesn't pan out. Especially because the hashes can be recalculated. Content-based identifiers to the rescue!
yup, precisely. I do love the fact that as long as I still have the data files, I can never truly lose/misplace/botch the record of which identifier goes to which data since we can just recompute hashes.
reported in #65