Closed cboettig closed 1 year ago
:+1: Nice!
I was wondering whether the term "nick" or "nickname" is a little more friendly than the more technical "alias" .
from http://xmlns.com/foaf/spec/#term_nick
Property: foaf:nick nickname - A short informal nickname characterising an agent (includes login identifiers, IRC and other chat nicknames).
fyi @mielliott
I'm pretty concerned about (1). To store the alias-hash mapping outside of the script or project breaks reproducibility. Saving in the local working directory is OK. I would try to think of a file organization covention that is transparent and could work with hand-editing and a manually coded workflow, so it's obvious to most users what is going on, and then just think of a convenience function on top of that.
I'm actually curious about the premise. Have users expressed this to you? Yes you want to have readable names for the files, but how is that different than just having a few bird_migration_data <- "hash://...."
calls at the top of a script? The hash needs to go somewhere.
@noamross yeah, me too. some thoughts:
First, note we could completely alleviate that risk if the alias file itself is referenced by it's content identifier & not a local path. The obvious downside there is that the user now responsible for making their alias file discoverable somehow. Tossing it in your GitHub repo and triggering a SoftwareHeritage snapshot is probably the easiest option, but probably still unattractive to users. Which is to say yeah, for lightweight use at least, I think I agree that some calls like bird_migration_data <- "hash://...."
are probably best.
Just for motivating the discussion though, the alias sheet is basically the same concept as the 'data manifest' we discussed earlier. e.g. consider the case of an R package like taxalight
which accesses external data sources by contentid::resolve()
The literal hashes are not embedded into the code file that actually calls contentid::resolve()
, but are instead derived from a manifest https://github.com/boettiger-lab/taxadb-cache/blob/master/prov.json. We still benefit from some of the hash-id features -- no re-downloading if we already have a copy, potential discovery of an alternative content host if a registered URL rots, etc, but we also have the ability to design against an interface without hardcoded hashes. An alias sheet is potentially just a lighter-weight version of said manifest (which may or may not be a good thing).
@jhpoelen I'm ok with foaf:nick
, though I think using the property might imply things about the rdf:type of the object that aren't true (i.e. foaf:nick
is a property of a foaf:Person
.). I think https://schema.org/alternateName might be more appropriate (or schema:name
, though really that doesn't imply a nice id-type string. For now I'm comfortable with alias()
as the CLI function / verb for creating these things, like you have in preston
. I feel like alias
can be read as a verb, but a command like preston nick hash://....
sounds more confusing.
I am continuing having fun experimenting with aliases - see example on creating a (versioned) alias pointing to a part of a file, in this case a bee name: https://github.com/bio-guoda/preston/issues/135#issuecomment-931617840 .
Also,
if the alias file itself is referenced by it's content identifier & not a local path.
Yes! In Preston, the alias is automatically added to a new version of the publication. Because preston alias beename.txt
only resolves aliases within it's content universe (i.e. linked provenance logs / manifest), the alias is well-defined for future use.
I like your suggestion and agree that preston nick beename.txt
looks a bit funny ; )
This makes me wonder - aren't urls and filenames just aliases for specific content?
Why treat a new name (e.g., file:beename.txt
) any different than https://example.org/beename.txt
or file:///my/path/beename.txt
?
In the end, the difference is the process which helps describe the relation between some name (or alias) and some specific content. For downloads this is a whole chain of actors (e.g., web server, DNS, firewall, proxy), whereas the explicit alias might only depend on some offline process or individual actor. For both cases, the end result is a statement saying: this name is associated with that content (modeled as a process activity in the provenance log).
@jhpoelen but the problem is that both names and urls are sometimes aliases for different "versions" of the "same" content (aka different content!), but other times they are aliases for static, unchanging content. Our use of filename and URL semantics is invariably imprecise on this point.
Because preston alias beename.txt only resolves aliases within it's content universe (i.e. linked provenance logs / manifest), ...
I think this makes sense, but sometimes I'm a bit unclear on how that universe is defined. e.g. if I run the command on a different computer, or put preston cat birds
in a shell script I post on GitHub, can someone else expect that to work for them too (without first re-generating the alias?) i.e. we still need a mechanism to publish/distribute the linked provenance log itself, right?
but the problem is that both names and urls are sometimes aliases for different "versions" of the "same" content (aka different content!), but other times they are aliases for static, unchanging content. Our use of filename and URL semantics is invariably imprecise on this point.
Yes, aliases are names, and the meaning of a name is in their relation of the context they exist. In the preston implementation, this context is the provenance log, allowing for non-unique aliases to explicitly exist in well-defined statement in the content universe (e.g., at position (a time proxy) X in prov log Y, alias Z point to hash A, where A and Z are content ids).
e.g. if I run the command on a different computer, or put preston cat birds in a shell script I post on GitHub, can someone else expect that to work for them too (without first re-generating the alias?) i.e. we still need a mechanism to publish/distribute the linked provenance log itself, right?
You can publish / distribute a Preston dataset by copying the data/
folder, similar to how you can copy a git repository by copying the .git
folder. Cloning, appending of Preston data packages is taken care of by Preston, but can also be implemented elsewhere.
A (remote) preston "push" is not yet implemented because this can be done with existing copyTo tools (e.g., cp
, scp
, rsync
, git
, and the ubiquitous upload button). A local preston push can be implemented using the preston copyTo
command: this copies stuff from one preston repo (can be remote) to another local repository.
A preston "pull" has been implemented in the preston clone [some endpoint]
or preston pull [some endpoint]
command. Note that a preston clone
is nothing more than running preston history --remote repo1,repo2,...
, preston ls --remote repo1,repo2,...
and preston verify --remote repo1,repo2,...
. The first retrieves the linked list of provenance logs, the second gets the associated provenance logs, and the third retrieves content referenced in the provenance logs.
For examples see https://github.com/bio-guoda/preston/#archiving and more.
You can publish / distribute a Preston dataset by copying the data/ folder
yes, totally, I get this. but as you note my preston data/
folder might be many gigabytes or terabytes. I think it's useful to be able to distribute the preston provenance log or alias map etc distinct from copying my whole data/
folder, right? (i gather the prov logs are in the data folder too, so I would just need the linked list of hashes corresponding to the prov logs). Maybe preston history
tells me this? but the mechanics of distributing the prov logs in a platform-agnostic way aren't quite clear. (Sorry I guess we're getting off thread since this is really a preston user question)
Yep, we are a little off-course 😕, but I think this conversation is crucial to making aliases well-defined, useful and shareable.
right now, the data/
folder contains three kinds of things:
These can all be stored separately. So for instance, you can have a superlightweight preston repo, with a 78 byte pointer in it, then, have a cascading sequence of remotes that store provenance logs, and content separately.
preston clone --remote https://mysuperlightweight.info/,https://provenance.store.bio/,https://archive.org
where:
https://mysuperlightweight.info/
stores a 78 byte hash pointers to next/first version of provenance log
https://provenance.store.bio/
provides access to (heavier) provenance logs
https://archive.org
is a content repository that stores all the content is the universe.
When implementing my first pass at remote archives in Zenodo, I used these three levels to speed up performance: the link files were stored as is, just like the provenance logs. And the content was packaged up in tar balls with segmented by hash prefix. This way, the preston history
is fast, preston ls
(print provenance logs) is ok, and preston clone
downloads the packaged tar ball for corresponding hash with matching prefix.
An example is:
https://zenodo.org/record/3852671
https://archive.org/download/biodiversity-dataset-archives/data.zip/data/
Perhaps a good way to try this is to setup a mirror in your lab . . . I'd welcome the duplication and donated (shared) server space ; )
The great thing about it is that you don't have to give me access to anything, you can simply run preston clone --remote https://deeplinker.bio
periodically, and share the data/
folder at some static http endpoint.
the mechanics of distributing the prov logs in a platform-agnostic way aren't quite clear.
Both the history files (or provenance links) and provenance logs are utf-8 encoded text files. The third layer, the "content", is just a bunch of bits and bytes (content agnostic) as far as Preston, or anyone else, is concerned.
While the provenance files are rdf/nquads I usually just use grep and friends to discover the provenance logs. For more rigorous analysis, I load the logs into a triple store.
So, I think that the setup is pretty platform agnostic.
@cboettig curious to hear whether I have addressed your concerns. . . or perhaps raised new ones ; )
my preston
data/
folder might be many gigabytes or terabytes. I think it's useful to be able to distribute the preston provenance log or alias map etc distinct from copying my wholedata/
folder, right?
@cboettig The easiest way to grab just the provenance files out of data/
is to run from an empty folder preston ls --remote file:///path/to/your/data/
-- preston will automatically download/copy only what it needs to run that preston ls
command, which is exactly 1) the linked list of provenance hashes and 2) the provenance logs themselves, and deposit them in a new data/
folder. Using the https://github.com/bio-guoda/preston-amazon dataset as an example,
$ preston ls --remote https://raw.githubusercontent.com/bio-guoda/preston-amazon/master/data/ > /dev/null
$ ls data/*/*/*
data/1a/a3/1aa34112ade084ccc8707388fbc329dcb8fae5f895cb266e3ad943f7495740b3
data/2a/5d/2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a
data/59/15/5915dffe1569ccd29cc7f6b8aea1341754318d206fef8daf225d8c40154ef8be
data/62/95/6295d7136ff2652a7849262c84af85244688fc13689791c058ae41c44dd0af4a
data/d7/b7/d7b73e3472d5a1989598f2a46116a4fc11dfb9ceacdf0a2b2f7f69737883c951
data/d8/f7/d8f72bd865686e99eac413b36d198fd15f305966d2864091320f1868279451ff
where all of those data/aa/bb/abcd...
files are only provenance and hexastore files (the linked list) that Jorrit mentioned, but none of the "actual" datasets. And, as you've said, preston history
will tell you which of the files contain preston provenance data
To tie back to the original topic :wink: after getting just the provenance/hexastore files, preston alias
would list the local aliases and preston get [alias] --remote [your-remote]
could be use to retrieve the associated files.
I hope this helps!
I still really like this thread, but digesting it slowly. Is it accurate to say that an alias
is a metadata assertion about a particular preston -- what's the right noun? -- collection?
So preston get [alias] --remote [remote-collection]
allows the get
operation to first resolve the human-friendly alias to the corresponding content identifier in the collection?
Is there an inverse operation, i.e. how do I ask preston for known aliases of some content hash? (Such a reverse operation could also be used/abused to associate additional metadata with a given hash). Or is that what you would do with grep
or a sparql query?
To tie into #69 , one could imagine similar operations in which get
is given some other piece of metadata and resolves content that has that metadata, e.g. get [citation]
? Or the reverse...
I'm still on the fence with aliases as part of the contentid API. It is simple enough for a user to maintain their own look-up table identifying aliases to identifiers suitable to their needs. Alternatively, a metadata record will often contain both object names (i.e. user friendly aliases) and ids (for instance, a http://schema.org/dataDownload has an http://schema.org/id, which can be a hash id, and a http://schema.org/name, which can be an alias.) It seems to me the natural way to use aliases would be to refer to this.
A major limitation in the current model is that many users are reluctant to deploy long hashes in code:
looks rather cumbersome. Assigning aliases could work around this. This is not dissimilar to the use of aliases in
pins
orstorrr
, but in our case, the alias does not become the primary key for the data. The alias is merely a shorthand for the hashalias(id, name)
would create an entry in a local file (tsv
maybe?) associating the alias with the id.resolve
would detect if a string was an alias (do we namespace aliases or merely attempt to resolve anything that doesn't start withhash://
as a potential alias reference?), and if so, attempt to translate it into the corresponding hash and resolve that as usual.resolve
would gain an optional argument ofaliases
to locate the alias file with a simple default location.Issues:
~/.share/local/R/contentid
), though users may want to utilize aliases across projects.schema:name
. This may have much added utility it in figuring out what's what, and open the door to using other file metadata (filename, format, description, author, etc) as a mechanism to resolve hashes. Downside is that parsing overhead may degrade performance, and greater complexity of implementation = more room for errors etc.Inspired by
preston
https://github.com/bio-guoda/preston/issues/135 cc @jhpoelen