cboettig / contentid

:package: R package for working with Content Identifiers
http://cboettig.github.io/contentid
Other
46 stars 2 forks source link

`store` assumes sha256 hashes, causes resolve w/ store to fail #66

Closed cboettig closed 3 years ago

cboettig commented 3 years ago
tmp <- contentid::resolve("hash://md5/e27c99a7f701dab97b7d09c467acf468", 
                             registries = "https://cn.dataone.org", 
                             store=TRUE)

reported in #65

cboettig commented 3 years ago

At least three possible strategies here:

  1. store the data in the content-store with the file named by whatever hash was used in the resolve call. This would still be pretty hokey obviously.
  2. store the data as sha256, but also create symlinks for other hashes. (may have issues on windows?)
  3. store the data in sha256, and register the location in a local tsv registry, and stop using the store as an implicit registry. This is probably best, but possible fragile?

Note that option 3 assumes a tsv registry is available in the first place, while the example above explicitly sets dataone as the only registry.

cboettig commented 3 years ago

symlinks done in #65

jhpoelen commented 3 years ago

Clever hack! The symlink approach would work, but doesn't seem as straight forward as non-file system based linking.

Why not create an explicit link table that keeps a sorted list of content hashes grouped by content on each line? This way, you can easily search external repos across the various hashes when requested.

cboettig commented 3 years ago

Yeah, great question, that was my first preference too, and may still be the best approach. But this gets back to how we think of the extension model -- creating an explicit file means using some kind of local registry for where to store that information. We could just assume that is a .tsv file or bagit manifest in the content_dir() directory, or we could use the local .tsv or lmdb registry. The trouble is just that -- our extendable model means we have a multiplicity of places where we might look, and because they are configurable, we don't know which ones may or may not be available on any given call.

Maybe it would still be better to assume the store follows bagit format, and maintain manifest-sha256.txt, manifest-md5.txt, etc. (I think bagit standard permits multiple manifests with different hashes?) Actually this is what we were doing with the store when we used only sha-256 already. But in an R environment, it adds some possibly significant overhead for large stores, since you need to then parse the full manifest into R to do a query for a given id. Under the symlink approach, you can formulate the hash into the correct filepath immediately, so it's much faster and independent of the store size. So to me, the symlink approach felt simpler and more efficient once I got into implementation...

jhpoelen commented 3 years ago

I enjoy simple and efficient approaches also. And, I was just wondering about mobility - moving the store around without losing the various way to access the data. The current solution will work until it doesn't and even then it's not a big deal, 'cause the hashes can be recalculated (at expense of a warming planet ;)) . Thanks for replying.

jhpoelen commented 3 years ago

Does minio / amazon s3 support symlinks?

cboettig commented 3 years ago

Does minio / amazon s3 support symlinks?

yeah, that's a good point, looks like no, at least on minio: https://minio.thelio.carlboettiger.info/minio/content-store/md5/2a/c3/2ac33190eab5a5c739bad29754532d76

Of course you could register the minio sha256 url instead, and then you'd still be able to query by either hash.

yeah, store mobility is a good question. exposing the content-dir as an S3 bucket on minio is obviously one way to do that, but it's just convention (i.e. it exploits the same assumption as the local store; that we can construct the access path from the hash without parsing a manifest). Creating a bagit is probably the best generic advice for store mobility, with the caveat that access tool really ought to parse the manifest(s) to get the paths.

Meanwhile, if I just tar up my CONTENTID_HOME dir, send it to you, you untar it at your CONTENTID_HOME location, I think everything should work as before from the contentid package perspective, right? i.e. tar/untar preserves symlinks fine I think?

jhpoelen commented 3 years ago

Not to justify the current implementation, but note that git-annex is using symlinks too . . . in combination with git.

Am sure we'll come up with another solution is the current approach doesn't pan out. Especially because the hashes can be recalculated. Content-based identifiers to the rescue!

cboettig commented 3 years ago

yup, precisely. I do love the fact that as long as I still have the data files, I can never truly lose/misplace/botch the record of which identifier goes to which data since we can just recompute hashes.