cboettig / contentid

:package: R package for working with Content Identifiers
http://cboettig.github.io/contentid
Other
44 stars 2 forks source link

Resolve against OCI / container registries as sha256-addressed content stores? #94

Open cboettig opened 1 year ago

cboettig commented 1 year ago

@jeroen pointed out to me that (docker or OCI) container registries like GitHub Package Registry are really just a bunch of sha256 addressed blobs. Moreover, that existing open-source tools like oras already make it pretty easy to push arbitrary content there.

Notably, the docker/OCI registry is already a widely adopted standard, easy to self-host and readily found on any major cloud provider. GitHub package registry is 'free for public packages' -- Jeroen notes that large projects like brew use it as their back-end storage and distribution medium.

For instance, here's a command to grab my favorite example from the GitHub content registry, you can merely request it by it's sha256 hash:

curl -fL --header "Authorization: Bearer QQ=="  "https://ghcr.io/v2/cboettig/content-store/blobs/sha256:9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37" -o test.txt

and verify this is indeed the Vostok ice core file. To push to the content store I've used the oras client tool:

oras login ghcr.io
oras push ghcr.io/cboettig/content-store ./vostok.co2

Container registeries actually have a nice manifest system for associating different files with a single manifest, and adding metadata, including file names, tags (i.e. version tags), and (I think) MIME types, as well as generic extensible metadata in labels, which all can facilitate much more discovery than the approach above where I just use opaque blobs. These features could be particularly interesting but maybe beyond the scope/generality of contentid?

Anyway, does seem like an intriguing way to deploy (public) data to a system that acts both as a content store and registry addressable by SHA256 hash, using a mechanism that is easily deployed locally (self-hosted container registry) and already in use by so many major providers. @jhpoelen @mbjones thoughts?

jhpoelen commented 1 year ago

@cboettig nice! Thanks for sharing. I haven't looked at this yet, but it definitely sounds interesting.

At a quick glance, the Open Container Initiative image spec https://github.com/opencontainers/image-spec/blob/main/spec.md sounds much like the architecture of Preston with a provenance / content layer, with the difference being that OCI image spec is more specific (less flexible) because Preston style package is using any text based format (including rdf/nquad), whereas OCI image spec is geared towards file systems and computer programs.

I'd be open to experimenting with supporting the OCI image spec in context of Preston code base.

jhpoelen commented 1 year ago

fyi @mielliott

jhpoelen commented 1 year ago

Note that no open access appears to be allowed for github content registry -

$ curl -L  "https://ghcr.io/v2/cboettig/content-store/blobs/sha256:9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37"
{"errors":[{"code":"UNAUTHORIZED","message":"authentication required"}]}
jhpoelen commented 1 year ago

Also, note that github doesn't seem to be able to locate content store SHAs via their api - is this expected? In which package registry did you publish hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37 ?

https://github.com/search?q=9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37

see attached screenshot

image

cboettig commented 1 year ago

Hi @jhpoelen

open access is absolutely supported, please be sure to set the header as specified in my example:

curl -fL --header "Authorization: Bearer QQ=="  "https://ghcr.io/v2/cboettig/content-store/blobs/sha256:9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37" -o test.txt

No authentication is required, only setting the header.

I published this example to the package called cboettig/content-store, though note that packages are manifests, it is possible to push blobs directly.

Note that this has nothing to do per se the GitHub API. To locate content on the container registry you need to search the container registry -- this could be any container registry -- e.g. the orca docs show using a zot registry on localhost. I haven't really played around with a local registry yet. So far though, I think you should be able to access the object by hash with the header as in my original example without any authentication

curl -fL --header "Authorization: Bearer QQ=="  "https://ghcr.io/v2/cboettig/content-store/blobs/sha256:<SHA>"
cboettig commented 1 year ago

@jhpoelen

OCI image spec is geared towards file systems and computer programs.

right, OCI was imagined for the purpose, but as you have so often said, bits are bits, it's just a key-value store indexed by sha256 hashes of those bits, and it really doesn't care whether they are rdf, or plain txt, or something else. As the oras docs say,

The OCI Artifacts project is an attempt to define an opinionated way to leverage OCI Registries for arbitrary artifacts without masquerading them as container images.

oras has a nice command line tool and client libraries, but as I understand it, it should be possible to interact with any Open Container Registry with standard tools (e.g. sha256 hashes and curl requests). this overview is a particularly nice summary which I think very much resonates with the preston design?

Independent of the oras implementation, you've probably seen the opencontainers descriptor spec for any compliant OCI registry. A nice list of compliant registries includes many open source self-hosting options as well as registries of most commercial cloud providers, which suggests to me this strategy has at least as much capacity to scale as IPFS, but who knows.

jhpoelen commented 1 year ago

Thanks for elaborating.

from

preston cat\
 --remote https://softwareheritage.org\
 'line:hash://sha256/69f9224697d534daf8079fad21cd45cbbe888720014455dc9b15b600fc8cc063!/L52,L53'

or

https://linker.bio/line:hash://sha256/69f9224697d534daf8079fad21cd45cbbe888720014455dc9b15b600fc8cc063!/L52,L53

or

https://github.com/oras-project/oras-www/blob/58694b970c619c6c5abc731195da22db0e264214/docs/client_libraries/overview.mdx#L52,L53

, I got pretty excited when I read -

Besides plain blobs, it is natural to store directed acyclic graphs (DAGs) in a CAS. Precisely, all blobs are leaf nodes and most manifests are non-leaf nodes.

Thank you!

Hoping to play around for more with this sooner rather than later. It'd be fun to have the oras documentation be hosted via . . . a content registry. . . aside from all known biodiversity datasets that have been tracked by Preston since 2018 (and dataone, and zenodo etc.etc.).

cboettig commented 1 year ago

@jhpoelen very cool, love the deep link referencing in preston btw!

I think ages ago we've talked about about hosting mirrors of preston, and I know we've also been on the lookout for more reliable / stable options for https://hash-archive.org, which seems to have gone forever dark now (even though my fork, https://hash-archive.carlboettiger.info/, is still up for now). My instinct is that these container registries are going to be around for a while (the open source docker registry idea / spec / software has already been around for about a decade, and clearly the concept is steadily growing). So I'm keen for alternatives that can function in a similar capacity to the hash-archive.org registry, (even better if it is also a content store), that has greater scalability and reliability. How do you feel about leveraging all these existing container registries, with their ability to both act as content-addressable-storage and host manifest as a DAG, as the new hash-archive.org or preston mirrors?

Obviously the container registries need a bit of software wrapping around them to make it easy to use them in this way rather than in the way they were originally intended (i.e. using docker or singularity, etc). oras is a nice example of this and maybe a good starting point, though probably not the only way to go about this...

jhpoelen commented 1 year ago

Leveraging, and experimenting with, existing infrastructure / technologies like container registries (via Open Container Initiative and associated projects like https://github.com/oras-project ) sounds like a good idea, especially now that @nfranz , @GregPost-ASU and friends at ASU (@n8upham) seem pretty excited about helping out to increase mobility of our diversity data through content-addressed repositories and associated registries.

They've already helped keep a copy of GenBank's plant sequence flat file archives in a preston package via BioKiC / Globus. (see use case linking Jenn Yost's San Luis Obispo Herbarium OBI https://github.com/globalbioticinteractions/globalbioticinteractions/issues/904 to their associated GenBank sequences).

Is there a particular use case you have in mind?

jhpoelen commented 12 months ago

Just implemented rudimentary support for github content repository:

$ preston cat --remote https://ghcr.io/cboettig/content-store hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37\
 | head -n5
*******************************************************************************
*** Historical CO2 Record from the Vostok Ice Core                          ***
***                                                                         ***
*** Source: J.M. Barnola                                                    ***
***         D. Raynaud                                                      ***
jhpoelen commented 12 months ago

Is there a registry of oci registries?

jhpoelen commented 12 months ago

Here's a list -

https://github.com/oras-project/oras-www/blob/58694b970c619c6c5abc731195da22db0e264214/docs/adopters.mdx#L15-L24

preston cat --remote https://softwareheritage.org 'line:hash://sha256/85044a71a67b1ad51e71e718eb773a5977f0f60c8bcf7771e61905ca9c160cfb!/L15-L24'

or

https://linker.bio/line:hash://sha256/85044a71a67b1ad51e71e718eb773a5977f0f60c8bcf7771e61905ca9c160cfb!/L15-L24

## Registries supporting OCI Artifacts

- [CNCF Distribution](#cncf-distribution) - local/offline verification
- [Amazon Elastic Container Registry](#amazon-elastic-container-registry-ecr)
- [Azure Container Registry](#azure-container-registry-acr)
- [Google Artifact Registry](#google-artifact-registry-gar)
- [GitHub Packages container registry](#github-packages-container-registry-ghcr)
- [Bundle Bar](#bundle-bar)
- [Docker Hub](#docker-hub)
- [Zot Registry](#zot-registry)
cboettig commented 12 months ago

Nice @jhpoelen , this is awesome.

Yup, I have a bunch of use-cases in mind!

I've always had my eye out for robust / low-friction ways for researchers to work with and distribute their own data using content-addressed storage -- i.e. from an R perspective, replacing read_csv("something.txt") with contentid::resolve("hash://sha256/xxx") |> read_csv() etc and have it work everywhere. DOI-granting archives like DataONE are excellent for long-term storage, but have drawbacks: (a) we can't predict the hashes DataONE uses, (b) Zenodo only supports md5sums, (c) submitting to these (especially DataONE repos) is high-friction due both to metadata and authentication processes, meaning that in practice it becomes a one-off at publication time, not a regular and automated part of researcher workflows at development time, (d) Zenodo and DataONE repos are optimized for archival storage, which in practice often means low-bandwidth downloads that make them cumbersome for daily primary access, and (e) while in principle they support authenticated access when required, e.g. for pre-publication data a researcher isn't ready to share, in practice this is too cumbersome a workflow (see c and d). (SoftwareHeritage is lower friction to submit and sha256-based, but requires small data files we can fit on github, and rate-limiting API again means it's only practical for it's intended use as an archive, not as a go-to object store).

In contrast, these OCI repos seem far better set up for scale and ease-of-use in day-to-day operations. (Note that I believe complements rather than competes with the scientific data repositories, which were never intended for this task). So my basic use case is that user can push data to an OCI, and retrieve it with a content-based address later, with minimal friction and maximum performance.

I'm curious as well about mirroring existing datasets in these repos or a preston archive into these OCI systems where we can access them with content-based addresses and benefit from the greater storage/bandwidth availability of these hosts. iiuc, with preston I can copy either just the registry metadata or the actual content over to my own machine -- maybe one could do that with, say, globalbioticinteractions?

mbjones commented 12 months ago

@cboettig this is all awesome. I'm just catching up with the thread, but seems so promising. Seems like making DataONE and other repository systems OCI compliant would be a great and easy approach to integrating traditional repositories with OCI. Given that Brew is using GHCR for their packages, I wonder if GHCR limits sizes of its public uploads? It seems it would be easy to upload copies of the DataONE corpus to GHCR as a caching/distribution layer.

I'm giving a brief orienting talk on content-based identifiers at ESIP tomorrow (https://sched.co/1NodS), and will plan to include this in my remarks on authority-based versus content-based identification. Its very cool. If either of you wanted to join remotely, they have hybrid access set up for registered conference folks.

cc @artntek @doulikecookiedough

jhpoelen commented 12 months ago

hey @mbjones - looks like a great conference - I probably won't join 'cause I am attending another workshop at the Field Museum Thu/Fri.

Can't help but plug our recent publication on application of content based identifiers in the form of signed citations of digital scientific data publication -

Elliott, M.J., Poelen, J.H. & Fortes, J.A.B. Signing data citations enables data verification and citation persistence. Sci Data 10, 419 (2023). https://doi.org/10.1038/s41597-023-02230-y https://linker.bio/hash://sha256/f849c870565f608899f183ca261365dce9c9f1c5441b1c779e0db49df9c2a19d

Please keep me posted on the responses you'll get, and insights gained following the ESIP sessions.

PS - I am still looking for a list of OCI enabled endpoints (I have ghcr.io, and hoping to add other ones, especially those that have open access support).

mbjones commented 12 months ago

Just a note: I see that OCI has consolidated on SHA-256 and SHA-512 as the only two officially listed hash algorithms for digests. See: https://github.com/opencontainers/image-spec/blob/main/descriptor.md#registered-algorithms

Oh, But I also see they say:

Implementations SHOULD allow digests with unrecognized algorithms to pass validation if they comply with the above grammar. While sha256 will only use hex encoded digests, separators in algorithm and alphanumerics in encoded are included to allow for extensions. As an example, we can parameterize the encoding and algorithm as multihash+base58:QmRZxt2b1FVZPNqd8hsiykDL3TdBDeTSPX9Kv46HmX4Gx8, which would be considered valid but unregistered by this specification.

mbjones commented 12 months ago

Hey, so do either of you know how to search an OCI registry like GHCR to resolve a hash without knowing the namespace that its in? For example, while this works, it requires info that is not in the hash:

oras blob fetch --output taxa.csv ghcr.io/mbjones/dom-taxa@sha256:d6b10f57f5a1f2ecb5f8b01dd6121698165562c5914d77be6067f00ff6634c35

But, is there a way to eliminate the need for the mbjones/dom-taxa namespace and just resolve the hash? Seems like hashes should be unique across repos/namespaces, and as long as the item is public it should resolve. But I haven't found a way to do that through the OCI api or the github API.

cboettig commented 12 months ago

@mbjones good question, I wondered that too.

There's an optional endpoint in the spec called referrers which looks like it might be related to this, but it looks like GHCR doesn't support that referrers endpoint at present anyway? In the homebrew examples it looks like namespace is hard-coded.

From the contentid perspective, in seems like one might need to consider the namespace and domain name together as defining an endpoint? Definitely not ideal, though if we're really leaning into the distributed logic I guess that's not different than working across, say, GHCR and GitLab OCI, or one of the self-hosted OCI systems (zot, harbor, etc)...

I'm not clear on size or bandwidth limits on GHCR, the docs for public storage only say "free".

I'm going to try playing a bit with the self-hosted registries, they look very simple to deploy...

mbjones commented 12 months ago

Ever referrers seems to include <name> in the API endpoint. The table at the end listing endpoints confirms that every endpoint includes <name> in the table:

https://github.com/opencontainers/distribution-spec/blob/main/spec.md#endpoints