dockstore / dockstore

Our VM/Docker sharing infrastructure and management component
https://dockstore.org/
Apache License 2.0
117 stars 27 forks source link

Cross site indexing of content #1049

Closed vsoch closed 3 years ago

vsoch commented 6 years ago

Feature Request

In response to this discussion https://twitter.com/DockstoreOrg/status/937730314746712064

It would be great to have a community developed, cross index of Dockerstore (and other container resource) content. I can add some detail about Singularity Hub / Registry, is that helps.

Minimally, given that a manifest is associated with some kind of container content, we would want to be able to search them. The general container naming convention that Singularity Hub uses:

<registry>/<namespace>/<container>:<digest>

Is also a nice search strategy in that it maps well to any kind of storage.

I don't have specific suggestions or ideas, but just want to open up the conversation. I am an advocate for a strategy that is easy, maybe even fun, and can be flexible to allow for many different resources (e.g., a formal registry, or a Github repository)

┆Issue is synchronized with this Jira Story ┆fixVersions: Dockstore 2.X ┆friendlyId: DOCK-299 ┆sprint: Backlog ┆taskType: Story

ArangoGutierrez commented 6 years ago

+1

denis-yuen commented 6 years ago

What is the difference between Singularity Registry and Singularity Hub? (https://singularity-hub.org/ seems to use Singularity Hub and Singularity Container registry)

Interesting though, https://singularity-hub.org/collections looks nice. We have a similar approach in that most of the data that we serve matches the models in our database, but we have a small subset of endpoints that serve a (hopefully more) generic representation of tools https://dockstore.org:8443/static/swagger-ui/index.html#/GA4GH .

vsoch commented 6 years ago

Singularity Hub is a cloud service that provides building, from Github repos. Singularity Registry is optimized for a local institution to deploy, and then push images to it. Both serve the same manifests for downloading, but the registry comes from a filesystem, and Singularity Hub is Google Cloud Storage.

Additional functions to parse the manifests that the registries serve (to then make them searchable from one box) would fit the model I was describing. If we extend that model to manifests of different kinds, then I think we have a good start!

But note the strategy I came up with is as simple as I possible could do - the entire page I linked is served statically (rendered automatically) via Github pages, and other registries are added by adding a markdown file and doing a PR. Any method we do to collect manifests should be that simple - if we have a server, then it could be just an interactive web interface to do it.

denis-yuen commented 6 years ago

Ok, so if I understand correctly. Individual institutions can spin up a Singularity Registry to provide Singularity containers in addition to using those from Singularity Hub. Currently, you have a strategy of recording what sites have Singularity Manifests by listing them in GitHub and generating a page using GitHub Pages based on that information.

We have a similar idea in that we were thinking of creating a Badge system to validate and check that systems that implement the Tool Registry Schema are valid and then creating list of them.

vsoch commented 6 years ago

Ok, so if I understand correctly. Individual institutions can spin up a Singularity Registry to provide Singularity containers in addition to using those from Singularity Hub.

Yes correct! The different registries are handled via the uri. The default (meaning none specified) uses singularity-hub.org, eg:

shub://vsoch/hello-world -->  shub://singularity-hub.org/vsoch/hello-world

and the container is accessible from the command line with singularity software proper:

singularity pull shub://vsoch/hello-world

and then a registry would just add in their address:

shub://dockerstore-registry.org/vsoch/hello-world
singularity pull shub://dockerstore-registry.org/vsoch/hello-world

Currently, you have a strategy of recording what sites have Singularity Manifests by listing them in GitHub and generating a page using GitHub Pages based on that information.

Yes, it's the easiest and simplest method I could think of to have manifests at least browsable from a single place. Having a server would be much better, but in academia we generally don't have funding for anything, so I use Github hacks a lot :)

We have a similar idea in that we were thinking of creating a Badge system to validate and check that systems that implement the Tool Registry Schema are valid and then creating list of them.

This is an awesome idea! I would love to help. Could you tell me more about the Tool Registry Schema? I'm guessing that we could start with some repository with metadata about the registries, and then have the endpoint checked against the schema, and producing a badge to reflect the final score?

ps-account commented 6 years ago

Great thread! Thank you for picking this up way faster than I could reach my nearest laptop @vsoch !!!! :) Really liking the badge / schema setup. Could ga4gh/dockstore also be a central place to register and provide as an uri the various schema ontology versions for all different tools (Docker/Singularity/CWL/etc), to have some rigid references to test against?

denis-yuen commented 6 years ago

Could you tell me more about the Tool Registry Schema? I'm guessing that we could start with some repository with metadata about the registries, and then have the endpoint checked against the schema, and producing a badge to reflect the final score?

So when we created Dockstore, we knew that we wanted to specialize in workflows that used Docker containers. The architecture of dockstore is an Angular front-end backed by a RESTful web service ( described here https://dockstore.org:8443/static/swagger-ui/index.html ). Most of the endpoints are pretty specific to our implementation, but we wanted to open the door to cross-indexing with other registries that might register something very different (like Singularity containers for example). So we created a set of endpoints that was intended on being more generic ( https://dockstore.org:8443/static/swagger-ui/index.html#/GA4GH ) and could be implemented more easily by other groups. These endpoints are intended on just being a simple way of retrieving part of or all of the containers in a tool registry.

There's a bit more in the poster here https://docs.google.com/presentation/d/1b0yLxW0Mms0rBw1x3H4wV22qtwRHhZCN1XoqAdytUwk/edit?usp=sharing

If you're interested or want to influence the schema, we're working on a new iteration of that schema based on some feedback right now actually.

Afterwards, we could cross-index each other for searching, sharing, etc.

The validation idea came later, based on a command-line utility here https://github.com/ga4gh/tool-registry-validator . We're currently working on this, but afterwards you could imagine a list of badges each pointing at Dockstore, Singularity Hub, etc. (kind of like Implementations under http://www.commonwl.org/ except for container repositories).

Could ga4gh/dockstore also be a central place to register and provide as an uri the various schema ontology versions for all different tools (Docker/Singularity/CWL/etc), to have some rigid references to test against?

Could you elaborate a bit more on this?

ps-account commented 6 years ago

Could ga4gh/dockstore also be a central place to register and provide as an uri the various schema ontology versions for all different tools (Docker/Singularity/CWL/etc), to have some rigid references to test against?

Could you elaborate a bit more on this?

Apologies for my poor wording, likely due to lack of a complete grasp of the topic from my side :)

I am wondering, in the context of findability, would it be a solution to not just register the containers themselves, but also provide a programmatically queriable container schema resource to register the container metadata, e.g. in the form of a centralized RDF service platform, similar to ebi for life science data ( https://www.ebi.ac.uk/rdf/datasets/ ), but then as an rdf platform for life science workflows?

Or, similarily, a way to query the ontologies for life science data analysis workflow components as you can do with life science experimental workflow components: https://www.ebi.ac.uk/ols/search?q=sample&groupField=iri&start=0

Hope I at least managed to word myself better, I guess we are either talking about the same thing, or about something completely different, or I am just out of scope (or all off these points) :)

denis-yuen commented 6 years ago

Oh ok, no worries, I think we're all coming at this from different perspectives and experiences. I agree that whatever solution we decide on should eventually (or immediately!) allow for searchable metadata about the containers. Dockstore generally relies on metadata in the CWL so what we do in the tool registry schema is we propose that you can retrieve all tool descriptors written in those languages (or page through them for large repositories). (If Singularity Hub has descriptors of some form, maybe that idea will work or if it doesn't we currently are working on updates to the tool registry schema anyway, so please suggest or PR ideas here)

Then whatever implements cross-site indexing would be able to grab all descriptors and use whatever searching or query technology they wish.

I'm less familiar wth RDF or these ontologies specifically, but I think @tetron has done some experiments with TRS and RDF. There's some prototype code in the validator for example https://github.com/ga4gh/tool-registry-validator#convert-a-tool-registry-response-to-rdf

unito-bot commented 4 years ago

➤ Vanessasaurus commented:

Sorry, what exactly is this?