cboettig / contentid

:package: R package for working with Content Identifiers
http://cboettig.github.io/contentid
Other
46 stars 2 forks source link

Consider support for aliases #79

Closed cboettig closed 1 year ago

cboettig commented 3 years ago

A major limitation in the current model is that many users are reluctant to deploy long hashes in code:

vostok_id <- "hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37"
vostok <- resolve(vostok_id,  store = TRUE)

looks rather cumbersome. Assigning aliases could work around this. This is not dissimilar to the use of aliases in pins or storrr, but in our case, the alias does not become the primary key for the data. The alias is merely a shorthand for the hash

Issues:

  1. This does create the greater risk that code would not be portable if the alias file is disassociated from the code script. Maybe it would help to have the alias file location default to the current working directory, rather than the package data directory (e.g. ~/.share/local/R/contentid), though users may want to utilize aliases across projects.
  2. Format for aliases:
    • a simple tsv of alias, hash might be convenient
    • but aliases could potentially be built into a more metadata rich index (e.g. a json-ld file like we generate with prov, where the alias is any schema:name. This may have much added utility it in figuring out what's what, and open the door to using other file metadata (filename, format, description, author, etc) as a mechanism to resolve hashes. Downside is that parsing overhead may degrade performance, and greater complexity of implementation = more room for errors etc.
  3. Sharing alias files. Ideally, a script would resolve the alias list using a content-id for the aliases file. This ensures that the aliases have not been altered to point at different content (though of course there are use cases where providing an updated set of aliases would be desired -- moving a package to access new releases of the data etc). This makes an added challenge though in how the alias file would be distributed in a way that we can access not only the 'latest' alias file but any particular version of it. ('chaining' the alias files rather than appending directly could resolve this...)

Inspired by preston https://github.com/bio-guoda/preston/issues/135 cc @jhpoelen

jhpoelen commented 3 years ago

:+1: Nice!

I was wondering whether the term "nick" or "nickname" is a little more friendly than the more technical "alias" .

from http://xmlns.com/foaf/spec/#term_nick

Property: foaf:nick nickname - A short informal nickname characterising an agent (includes login identifiers, IRC and other chat nicknames).

jhpoelen commented 3 years ago

fyi @mielliott

noamross commented 3 years ago

I'm pretty concerned about (1). To store the alias-hash mapping outside of the script or project breaks reproducibility. Saving in the local working directory is OK. I would try to think of a file organization covention that is transparent and could work with hand-editing and a manually coded workflow, so it's obvious to most users what is going on, and then just think of a convenience function on top of that.

I'm actually curious about the premise. Have users expressed this to you? Yes you want to have readable names for the files, but how is that different than just having a few bird_migration_data <- "hash://...." calls at the top of a script? The hash needs to go somewhere.

cboettig commented 3 years ago

@noamross yeah, me too. some thoughts:

First, note we could completely alleviate that risk if the alias file itself is referenced by it's content identifier & not a local path. The obvious downside there is that the user now responsible for making their alias file discoverable somehow. Tossing it in your GitHub repo and triggering a SoftwareHeritage snapshot is probably the easiest option, but probably still unattractive to users. Which is to say yeah, for lightweight use at least, I think I agree that some calls like bird_migration_data <- "hash://...." are probably best.

Just for motivating the discussion though, the alias sheet is basically the same concept as the 'data manifest' we discussed earlier. e.g. consider the case of an R package like taxalight which accesses external data sources by contentid::resolve()

https://github.com/cboettig/taxalight/blob/e80a186693984c8d1d9c6e05f8d4d31f2d9399ac/R/tl_import.R#L71-L75

The literal hashes are not embedded into the code file that actually calls contentid::resolve(), but are instead derived from a manifest https://github.com/boettiger-lab/taxadb-cache/blob/master/prov.json. We still benefit from some of the hash-id features -- no re-downloading if we already have a copy, potential discovery of an alternative content host if a registered URL rots, etc, but we also have the ability to design against an interface without hardcoded hashes. An alias sheet is potentially just a lighter-weight version of said manifest (which may or may not be a good thing).

@jhpoelen I'm ok with foaf:nick, though I think using the property might imply things about the rdf:type of the object that aren't true (i.e. foaf:nick is a property of a foaf:Person.). I think https://schema.org/alternateName might be more appropriate (or schema:name, though really that doesn't imply a nice id-type string. For now I'm comfortable with alias() as the CLI function / verb for creating these things, like you have in preston. I feel like alias can be read as a verb, but a command like preston nick hash://.... sounds more confusing.

jhpoelen commented 3 years ago

I am continuing having fun experimenting with aliases - see example on creating a (versioned) alias pointing to a part of a file, in this case a bee name: https://github.com/bio-guoda/preston/issues/135#issuecomment-931617840 .

Also,

if the alias file itself is referenced by it's content identifier & not a local path.

Yes! In Preston, the alias is automatically added to a new version of the publication. Because preston alias beename.txt only resolves aliases within it's content universe (i.e. linked provenance logs / manifest), the alias is well-defined for future use.

https://schema.org/alternateName

I like your suggestion and agree that preston nick beename.txt looks a bit funny ; )

This makes me wonder - aren't urls and filenames just aliases for specific content?

Why treat a new name (e.g., file:beename.txt) any different than https://example.org/beename.txt or file:///my/path/beename.txt ?

In the end, the difference is the process which helps describe the relation between some name (or alias) and some specific content. For downloads this is a whole chain of actors (e.g., web server, DNS, firewall, proxy), whereas the explicit alias might only depend on some offline process or individual actor. For both cases, the end result is a statement saying: this name is associated with that content (modeled as a process activity in the provenance log).

cboettig commented 3 years ago

@jhpoelen but the problem is that both names and urls are sometimes aliases for different "versions" of the "same" content (aka different content!), but other times they are aliases for static, unchanging content. Our use of filename and URL semantics is invariably imprecise on this point.

Because preston alias beename.txt only resolves aliases within it's content universe (i.e. linked provenance logs / manifest), ...

I think this makes sense, but sometimes I'm a bit unclear on how that universe is defined. e.g. if I run the command on a different computer, or put preston cat birds in a shell script I post on GitHub, can someone else expect that to work for them too (without first re-generating the alias?) i.e. we still need a mechanism to publish/distribute the linked provenance log itself, right?

jhpoelen commented 3 years ago

but the problem is that both names and urls are sometimes aliases for different "versions" of the "same" content (aka different content!), but other times they are aliases for static, unchanging content. Our use of filename and URL semantics is invariably imprecise on this point.

Yes, aliases are names, and the meaning of a name is in their relation of the context they exist. In the preston implementation, this context is the provenance log, allowing for non-unique aliases to explicitly exist in well-defined statement in the content universe (e.g., at position (a time proxy) X in prov log Y, alias Z point to hash A, where A and Z are content ids).

e.g. if I run the command on a different computer, or put preston cat birds in a shell script I post on GitHub, can someone else expect that to work for them too (without first re-generating the alias?) i.e. we still need a mechanism to publish/distribute the linked provenance log itself, right?

You can publish / distribute a Preston dataset by copying the data/ folder, similar to how you can copy a git repository by copying the .git folder. Cloning, appending of Preston data packages is taken care of by Preston, but can also be implemented elsewhere.

A (remote) preston "push" is not yet implemented because this can be done with existing copyTo tools (e.g., cp, scp, rsync, git, and the ubiquitous upload button). A local preston push can be implemented using the preston copyTo command: this copies stuff from one preston repo (can be remote) to another local repository.

A preston "pull" has been implemented in the preston clone [some endpoint] or preston pull [some endpoint] command. Note that a preston clone is nothing more than running preston history --remote repo1,repo2,..., preston ls --remote repo1,repo2,... and preston verify --remote repo1,repo2,.... The first retrieves the linked list of provenance logs, the second gets the associated provenance logs, and the third retrieves content referenced in the provenance logs.

For examples see https://github.com/bio-guoda/preston/#archiving and more.

cboettig commented 3 years ago

You can publish / distribute a Preston dataset by copying the data/ folder

yes, totally, I get this. but as you note my preston data/ folder might be many gigabytes or terabytes. I think it's useful to be able to distribute the preston provenance log or alias map etc distinct from copying my whole data/ folder, right? (i gather the prov logs are in the data folder too, so I would just need the linked list of hashes corresponding to the prov logs). Maybe preston history tells me this? but the mechanics of distributing the prov logs in a platform-agnostic way aren't quite clear. (Sorry I guess we're getting off thread since this is really a preston user question)

jhpoelen commented 3 years ago

Yep, we are a little off-course 😕, but I think this conversation is crucial to making aliases well-defined, useful and shareable.

right now, the data/ folder contains three kinds of things:

  1. the linked list of provenance log versions in the hexastore
  2. the provenance logs
  3. tracked content

These can all be stored separately. So for instance, you can have a superlightweight preston repo, with a 78 byte pointer in it, then, have a cascading sequence of remotes that store provenance logs, and content separately.

preston clone --remote https://mysuperlightweight.info/,https://provenance.store.bio/,https://archive.org

where:

https://mysuperlightweight.info/ stores a 78 byte hash pointers to next/first version of provenance log

https://provenance.store.bio/ provides access to (heavier) provenance logs

https://archive.org is a content repository that stores all the content is the universe.

When implementing my first pass at remote archives in Zenodo, I used these three levels to speed up performance: the link files were stored as is, just like the provenance logs. And the content was packaged up in tar balls with segmented by hash prefix. This way, the preston history is fast, preston ls (print provenance logs) is ok, and preston clone downloads the packaged tar ball for corresponding hash with matching prefix.

An example is:

  1. https://zenodo.org/record/3852671

    • archived copy of history and provenance logs re: tracked content registered in idigbio and gbif networks
  2. https://archive.org/download/biodiversity-dataset-archives/data.zip/data/

    • keeps history, provenance logs and 500GB of archived content
  3. https://deeplinker.bio

    • keeps most recent history, provenance logs, and content . Currently overlaps with 1) and 2), but might not in the future. 3) can purge data that is redundantly stored elsewhere to free up hard disk space locally.
jhpoelen commented 3 years ago

Perhaps a good way to try this is to setup a mirror in your lab . . . I'd welcome the duplication and donated (shared) server space ; )

jhpoelen commented 3 years ago

The great thing about it is that you don't have to give me access to anything, you can simply run preston clone --remote https://deeplinker.bio periodically, and share the data/ folder at some static http endpoint.

jhpoelen commented 3 years ago

the mechanics of distributing the prov logs in a platform-agnostic way aren't quite clear.

Both the history files (or provenance links) and provenance logs are utf-8 encoded text files. The third layer, the "content", is just a bunch of bits and bytes (content agnostic) as far as Preston, or anyone else, is concerned.

While the provenance files are rdf/nquads I usually just use grep and friends to discover the provenance logs. For more rigorous analysis, I load the logs into a triple store.

So, I think that the setup is pretty platform agnostic.

@cboettig curious to hear whether I have addressed your concerns. . . or perhaps raised new ones ; )

mielliott commented 3 years ago

my preston data/ folder might be many gigabytes or terabytes. I think it's useful to be able to distribute the preston provenance log or alias map etc distinct from copying my whole data/ folder, right?

@cboettig The easiest way to grab just the provenance files out of data/ is to run from an empty folder preston ls --remote file:///path/to/your/data/ -- preston will automatically download/copy only what it needs to run that preston ls command, which is exactly 1) the linked list of provenance hashes and 2) the provenance logs themselves, and deposit them in a new data/ folder. Using the https://github.com/bio-guoda/preston-amazon dataset as an example,

$ preston ls --remote https://raw.githubusercontent.com/bio-guoda/preston-amazon/master/data/ > /dev/null
$ ls data/*/*/*
data/1a/a3/1aa34112ade084ccc8707388fbc329dcb8fae5f895cb266e3ad943f7495740b3
data/2a/5d/2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a
data/59/15/5915dffe1569ccd29cc7f6b8aea1341754318d206fef8daf225d8c40154ef8be
data/62/95/6295d7136ff2652a7849262c84af85244688fc13689791c058ae41c44dd0af4a
data/d7/b7/d7b73e3472d5a1989598f2a46116a4fc11dfb9ceacdf0a2b2f7f69737883c951
data/d8/f7/d8f72bd865686e99eac413b36d198fd15f305966d2864091320f1868279451ff

where all of those data/aa/bb/abcd... files are only provenance and hexastore files (the linked list) that Jorrit mentioned, but none of the "actual" datasets. And, as you've said, preston history will tell you which of the files contain preston provenance data

To tie back to the original topic :wink: after getting just the provenance/hexastore files, preston alias would list the local aliases and preston get [alias] --remote [your-remote] could be use to retrieve the associated files.

I hope this helps!

cboettig commented 3 years ago

I still really like this thread, but digesting it slowly. Is it accurate to say that an alias is a metadata assertion about a particular preston -- what's the right noun? -- collection?

So preston get [alias] --remote [remote-collection] allows the get operation to first resolve the human-friendly alias to the corresponding content identifier in the collection?

Is there an inverse operation, i.e. how do I ask preston for known aliases of some content hash? (Such a reverse operation could also be used/abused to associate additional metadata with a given hash). Or is that what you would do with grep or a sparql query?

To tie into #69 , one could imagine similar operations in which get is given some other piece of metadata and resolves content that has that metadata, e.g. get [citation]? Or the reverse...

cboettig commented 1 year ago

I'm still on the fence with aliases as part of the contentid API. It is simple enough for a user to maintain their own look-up table identifying aliases to identifiers suitable to their needs. Alternatively, a metadata record will often contain both object names (i.e. user friendly aliases) and ids (for instance, a http://schema.org/dataDownload has an http://schema.org/id, which can be a hash id, and a http://schema.org/name, which can be an alias.) It seems to me the natural way to use aliases would be to refer to this.