Closed denis-yuen closed 5 years ago
let me know how I can help @denis-yuen !
@vsoch thanks! I've created an update to the schema on a branch that tries to generalize away from Docker to containers in general https://github.com/ga4gh/tool-registry-schemas/blob/generalize/src/main/resources/swagger/ga4gh-tool-discovery.yaml Please take a look and see if anything is fundamentally incompatible with the use of Singularity images or Singularity Hub.
Not sure if you've worked with Swagger, but I like to use the swagger editor http://editor.swagger.io/#/ (File -> Import URL)
Please ask if you have any questions or if the descriptions are unclear!
Awesome, @denis-yuen ! ( I used swagger for the first time, using the raw link worked! https://raw.githubusercontent.com/ga4gh/tool-registry-schemas/generalize/src/main/resources/swagger/ga4gh-tool-discovery.yaml )
One suggestion for consistency, to change the "dockerfile" boolean in ToolVersion to "containerfile" Currently: dockerfile: type: boolean description: Reports if this tool has a dockerfile available.
I know (and use) Swagger for sure! Here is my complete feedback on the specificiation, and a few general points:
https://singularityhub.github.io/sregistry-cli
swagger: '2.0'
info:
title: GA4GH Tool Discovery API
description: >-
Proposed API for GA4GH tool repositories. A tool consists of a set of container images that are paired with a set of documents (examples include CWL or WDL) that describes how to use those images and a set of specifications for those images (examples are Dockerfiles or Singularity recipes) that describe how to re-produce those images in the future. We use the following terminology, an "container image" describes a container as stored at rest on a filesystem, a "tool" describes one of the triples as described above. In practice, examples of "tools" include CWL CommandLineTools, CWL Workflows, WDL workflows, and Nextflow workflows that reference containers in formats such as Docker or Singularity.
Acronyms tend to only be known by a small number. What about a name that goes more for the Google culture (name it what it is!) If this is a container discovery API, why not call it that.
version: 2.0.0
produces:
- application/json
- text/plain
basePath: /api/ga4gh/v2
was there a version 1 already?
tags:
- name: GA4GH
description: A set of resources proposed as a common standard for tool repositories
paths:
Just wondering, what are tags for? Just metadata? If it's some kind of swagger search feature, it might make sense to extend this list a bit so people interested in containers can find it.
'/tools/{id}':
get:
summary: 'List one specific tool, acts as an anchor for self references'
description: >-
This endpoint returns one specific tool (which has ToolVersions nested inside it)
tags:
- GA4GH
parameters:
- name: id
in: path
required: true
type: string
description: >-
A unique identifier of the tool, scoped to this registry, for
example `123456`
I might be just using this wrongly, but I think the id here would corresponds with the tool uri (e.g., for singularity we would say docker:// or shub://. But on second thought, id might be more appropriate because it's very general.
responses:
'200':
description: A tool.
schema:
$ref: '#/definitions/Tool'
So you don't need to list the other potential responses (e.g., 404 not found, 400/401 for Unathorized and Authentication needed). I would imagine that some tools are not publicly available? I'll make this comment once, and just take note to think about for all the other API calls.
'/tools/{id}/versions':
get:
summary: List versions of a tool
description: Returns all versions of the specified tool
tags:
- GA4GH
parameters:
- name: id
in: path
required: true
type: string
description: >-
A unique identifier of the tool, scoped to this registry, for
example `123456`
responses:
'200':
description: An array of tool versions
schema:
type: array
items:
$ref: '#/definitions/ToolVersion'
'/tools/{id}/versions/{version_id}':
get:
summary: 'List one specific tool version, acts as an anchor for self references'
description: This endpoint returns one specific tool version
tags:
- GA4GH
parameters:
- name: id
in: path
required: true
type: string
description: >-
A unique identifier of the tool, scoped to this registry, for
example `123456`
- name: version_id
in: path
required: true
type: string
description: >-
An identifier of the tool version, scoped to this registry, for
example `v1`
responses:
'200':
description: A tool version.
schema:
$ref: '#/definitions/ToolVersion'
Maybe each version of the tool would have an associated url or uri? For example, sregistry client (if following this schema) would have a direct link to the Github repo tag, or release on pypi.
/tools:
get:
summary: List all tools
description: >
This endpoint returns all tools available or a filtered subset using
metadata query parameters.
tags:
- GA4GH
parameters:
- name: id
type: string
in: query
description: >-
A unique identifier of the tool, scoped to this registry, for
example `123456`
- name: registry
in: query
type: string
description: The image registry that contains the image.
oh interesting, so this means that one registry could have more than one tool, but one tool would need a separate record for each registry?
- name: organization
in: query
type: string
description: The organization in the registry that published the image.
Do you think organization is a useful field to have?
- name: name
in: query
type: string
description: The name of the image.
- name: toolname
If you have something like Docker with layers, does this actually reference the manifest/digest for layers to make up an image?
in: query
type: string
description: The name of the tool.
- name: description
in: query
type: string
description: The description of the tool.
- name: author
in: query
type: string
description: >-
The author of the tool (TODO a thought occurs, are we assuming that
the author of the CWL and the image are the same?).
- $ref: '#/parameters/offset'
- $ref: '#/parameters/limit'
What about an external link / file with a list of authors? Or a list of authors here?
responses:
'200':
description: An array of Tools that match the filter.
schema:
type: array
items:
$ref: '#/definitions/Tool'
headers:
next_page:
description: >-
A URL that can be used to reach the next page based on the
current offset and page record limit
type: string
last_page:
description: >-
A URL that can be used to reach the last page based on the
current page record limit
type: string
current_offset:
description: The current start index of the paging used for this result
type: string
current_limit:
description: The current page record limit used for this result
type: integer
Don't forget the all mightly selfLink!
'/tools/{id}/versions/{version_id}/{type}/descriptor':
get:
summary: Get the tool descriptor for the specified tool.
description: Returns the descriptor for the specified tool (examples include WDL, CWL, or Nextflow documents).
tags:
- GA4GH
parameters:
- name: type
required: true
in: path
description: >-
The output type of the descriptor. If not specified it is up to the
underlying implementation to determine which output type to return.
Plain types return the bare descriptor while the "non-plain" types
return a descriptor wrapped with metadata. Allowable values include
"CWL", "WDL", "NFL", "PLAIN_CWL", "PLAIN_WDL", "PLAIN_NFL".
Why did the theme get very workflow heavy / oriented all of a sudden? I do a lot with Singularity / Docker, and even scientific workflows, but I can't say I've used any of those much :)
type: string
- name: id
in: path
description: >-
A unique identifier of the tool, scoped to this registry, for
example `123456`
required: true
type: string
- name: version_id
in: path
required: true
type: string
description: >-
An identifier of the tool version for this particular tool registry,
for example `v1`
responses:
'200':
description: The tool descriptor.
schema:
$ref: '#/definitions/ToolDescriptor'
'404':
description: The tool can not be output in the specified type.
schema:
$ref: '#/definitions/Error'
'/tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path}':
get:
summary: Get additional tool descriptor files relative to the main file
description: >-
Descriptors can often include imports that refer to additional descriptors. This returns additional descriptors for the specified tool in the same or other directories that can be reached as a relative path. This endpoint can be useful for workflow engine implementations like cwltool to programmatically download all the descriptors for a tool and run it
tags:
- GA4GH
parameters:
- name: type
in: path
required: true
description: >-
The output type of the descriptor. If not specified it is up to the
underlying implementation to determine which output type to return.
Plain types return the bare descriptor while the "non-plain" types
return a descriptor wrapped with metadata. Allowable values are
"CWL", "WDL", "NFL", "PLAIN_CWL", "PLAIN_WDL", "PLAIN_NFL".
type: string
- name: id
in: path
description: >-
A unique identifier of the tool, scoped to this registry, for example `123456`
required: true
type: string
- name: version_id
in: path
required: true
type: string
description: >-
An identifier of the tool version for this particular tool registry, for example `v1`
- name: relative_path
in: path
required: true
type: string
description: >-
A relative path to the additional file (same directory or
subdirectories), for example 'foo.cwl' would return a 'foo.cwl' from
the same directory as the main descriptor. 'nestedDirectory/foo.cwl' would return the file
from a nested subdirectory
responses:
'200':
description: The tool descriptor.
schema:
$ref: '#/definitions/ToolDescriptor'
'404':
description: The tool can not be output in the specified type.
schema:
$ref: '#/definitions/Error'
Interesting, I think it just swung in a direction of being an API for workflow tools? I think the API or spec should probably be agnostic to the actual kind/type of tool, meaning that you could allow for more of a registry, or more of a workflow thing. I might not have a good understanding of what you are describing - could you give me a few sentence summary (in easy to understand, baby dinosaur terms, haha).
'/tools/{id}/versions/{version_id}/{type}/tests':
get:
summary: Get an array of test JSONs suitable for use with this descriptor type.
tags:
- GA4GH
parameters:
- name: type
required: true
in: path
description: >-
The output type of the descriptor. If not specified it is up to the
underlying implementation to determine which output type to return.
Plain types return the bare descriptor while the "non-plain" types
return a descriptor wrapped with metadata. Allowable values are
"CWL", "WDL", "NFL", "PLAIN_CWL", "PLAIN_WDL", and "PLAIN_NFL"
type: string
- name: id
in: path
description: >-
A unique identifier of the tool, scoped to this registry, for
example `123456`
required: true
type: string
- name: version_id
in: path
required: true
type: string
description: >-
An identifier of the tool version for this particular tool registry,
for example `v1`
responses:
'200':
description: The tool test JSON response.
schema:
type: array
items:
$ref: '#/definitions/ToolTests'
'404':
description: The tool can not be output in the specified type.
schema:
$ref: '#/definitions/Error'
'/tools/{id}/versions/{version_id}/containerfile':
get:
summary: Get the container specification(s) for the specified image.
description: Returns the container specifications(s) for the specified image. For example, a CWL CommandlineTool can be associated with one specification for a container, a CWL Workflow can be associated with multiple specifications for containers
tags:
- GA4GH
parameters:
- name: id
in: path
description: >-
A unique identifier of the tool, scoped to this registry, for
example `123456`
required: true
type: string
- name: version_id
in: path
required: true
type: string
description: >-
An identifier of the tool version for this particular tool registry,
for example `v1`
responses:
'200':
description: The tool payload.
schema:
type: array
items:
$ref: '#/definitions/ToolContainerfile'
'404':
description: There are no container specifications for this tool
schema:
$ref: '#/definitions/Error'
I wouldn't necessarily call this tests - a test is an assertion that something is, or isn't. A specification is a build recipe for a thing.
/metadata:
get:
summary: Return some metadata that is useful for describing this registry
description: Return some metadata that is useful for describing this registry
tags:
- GA4GH
responses:
'200':
description: A Metadata object describing this service.
schema:
$ref: '#/definitions/Metadata'
I like this one a lot :) This is where you would put things like labels, environment. It maps nicely to the Singularity inspect command. Docker has that too. Maybe call it inspect instead of metadata?
/toolClasses:
get:
summary: List all tool types
description: |
This endpoint returns all tool-classes available
tags:
- GA4GH
responses:
'200':
description: A list of potential tool classes.
schema:
type: array
items:
$ref: '#/definitions/ToolClass'
These classes are variables, or pre-defined by the specification? What are they? Going to skip some here...
...
signed:
type: boolean
description: Reports whether this tool has been signed.
versions:
description: A list of versions for this tool
type: array
items:
$ref: '#/definitions/ToolVersion'
The one thing maybe missing here is the distinction between a version, and then something like a url/uri to always be able to get that version.
dockerfile:
type: boolean
description: Reports if this tool has a dockerfile available.
meta_version:
type: string
description: >-
The version of this tool version in the registry. Iterates when fields
like the description, author, etc. are updated.
verified:
type: boolean
description: >-
Reports whether this tool has been verified by a specific organization
or individual
verified_source:
type: string
description: >-
Source of metadata that can support a verified tool, such as an email
or URL
Do you want to stick to the standard of a Dockerfile? The Singularity equivalent is called Singularity. I think if a tool is defined, the tool should be up to defining what it's build recipe can be called. But if you believe Docker is standard enough that most will at least be able to ask the question "Do we have a Dockerfile" then this is probably ok.
DescriptorType:
type: string
enum:
- CWL
- WDL
- NFL
Again, I would again question why the focus on workflow stuffs! This is coming out really cool!
Overall, for compatability with Singularity we would want to have places for:
A Singularity image can be started as an instance, and that means having a start script. So for now I would distinguish those two different kinds of Singularity images - just container executables, and container executables + instances.
Docker also has different manifests that are determined based on the user's wanted operating system and architecture (see the schemaVersion 2.0) so that might be something to take into account.
Hi,
Reading from top to bottom: @pimpim made the change to toolversion, missed that one!
@vsoch
In that this is a discovery (or other interface) for finding containers, it sounds a lot like the Singularity Registry Global client I've been working on ...
Cool, I wasn't aware of this project, it looks pretty interesting. The tool registry schema is an interface for finding workflows which happen to use containers, so there definitely seems like there could be some overlap in that a Singularity Registry Global client could look at this API to find containers to use.
was there a version 1 already?
Yup, but we needed to iterate to 2 since we had some input that version 1 did not match some recommendations to make it more protobuf/Javascript friendly that were unfortunately not backwards compatible.
Just wondering, what are tags for?
AFAIK, they're just used to organize endpoints in the swagger editor/ui. In other words, if your server implements a lot of endpoints for other purposes, you can group the these ones together for easy viewing.
(Edit to add: re-reading, they also group generated client API methods and server endpoints when using swagger-codegen)
So you don't need to list the other potential responses (e.g., 404 not found, 400/401 for Unathorized and Authentication needed)
To be thorough, we should probably enumerate more responses
so this means that one registry could have more than one tool, but one tool would need a separate record for each registry
So an example, this API might have workflows that use either Docker images from quay.io or Docker hub. Setting this to "quay.io" would return the workflows that use images from quay.io. Workflows could use images from both registries, and I would assume this would return them too (e.g. or rather than and)
Do you think organization is a useful field to have?
I think so, for example filtering tools from https://quay.io/organization/epigenomicscrew or https://quay.io/organization/ga4gh-dream
If you have something like Docker with layers, does this actually reference the manifest/digest for layers to make up an image?
I'm guessing no. Do you have a use-case for sharing this?
What about an external link / file with a list of authors? Or a list of authors here?
I'm fine with either if you prefer one.
Why did the theme get very workflow heavy / oriented all of a sudden?
I tried to describe this in the intro, perhaps not as well as I could have. In short, for Dockstore we found a lot of value in describing containers in some way. The easy example is CWL (Common Workflow Language) which can describe a single container as a CommandLineTool or strung together in a Workflow. This provides a handy place to describe the parameters that a container takes, what input files it might use, output files, metadata, etc. So we sought out similarly minded folks and found the GA4GH (Global Alliance for Genomics and Health), specifically the Containers and Workflows task team (now called the Cloud Work Stream). Hopefully that explains both the history and the practical reason we approach it this way.
There's a bit of description under info -> description, which I probably need to beef up.
I wouldn't necessarily call this tests - a test is an assertion that something is, or isn't. A specification is a build recipe for a thing.
There's actually both here. Build recipes are returned by '/tools/{id}/versions/{version_id}/containerfile'
and tests are returned by /tools/{id}/versions/{version_id}/{type}/tests
. The idea with tests is that they're supposed to be sets of parameters that run a workflow successfully. This is used well in the GA4GH-DREAM challenge where people exchange dockerized workflows to be run and see how many platforms successfully run them (It's entirely possible that an implementation of this API doesn't have these in which case, it would just return an empty array)
Do you want to stick to the standard of a Dockerfile?
Caught above, I changed most references to a containerfile but missed this one.
Ok, this is getting longish, so breaking out the overall bit into a separate comment.
So overall:
the registry url to download
Sure, do you have an example for me to put as an example description? We normally just pull the image ( https://github.com/ga4gh/tool-registry-schemas/blob/develop/src/main/resources/swagger/ga4gh-tool-discovery.yaml#L381 would be docker pull quay.io/seqware/seqware_full/1.1
as an example)
the uri (name) of the image
ok
a link to the build recipe (Singularity.* file)
This was meant to be ToolContainerFile https://github.com/ga4gh/tool-registry-schemas/blob/generalize/src/main/resources/swagger/ga4gh-tool-discovery.yaml#L512 which can store a URL to a recipe and/or the recipe directly.
one or more container entrypoints (e.g., see Scientific Filesystem https://vsoch.github.io/scif)
We tend to have the CWL or WDL documents describe the entrypoint. Can the scientific filesystem be used as a descriptor?
inspect metadata (labels, definition file, runscript, environment) some containers are intended to work with host drivers via the --nv flag what about the image format? E.g., Singularity started as ext3, now is squashfs, which means not writable.
I think maybe you have more experience than I with these options and can maybe make a PR or similar for these?
Hopefully that's a good start. I'll try to incorporate some of the more straight-forward changes directly.
Great comments! Considering the --nv: Maybe an idea to make a general custom flag entry for containers?
As a side note: does ga4gh have a provision for nvidia-docker? Though maybe not (yet) a large issue in the context of ga4gh.
For nvidia-docker, probably not (yet?). Since this API was intended to share the container image (what to run) and the descriptor (how to run it), this sounds like we would currently punt to the workflow language. i.e. I don't know how custom hints like that are dealt with in languages like CWL, might be a question for @mr-c or @tetron
(I know there was some work on things like udocker, there a CWL Requirement for that like a http://www.commonwl.org/v1.0/CommandLineTool.html#DockerRequirement for Docker?)
Edit to add: should probably have mentioned at the top that if you wish to play with a live copy of the non-generalized v1 version to see what it might look like in Docker-land, you can play around with https://dockstore.org:8443/static/swagger-ui/index.html#/GA4GH
FYI, was reviewing changes for CWL 1.1 and I did run into https://github.com/common-workflow-language/common-workflow-language/issues/587 and https://github.com/common-workflow-language/common-workflow-language/issues/374 which seem to be the way CWL tackles GPU workflows
Thanks for the heads-up @rishidev The above commit has been merged but there is probably good feedback above as well. However, I think in the spirit of the Monday call to move things along, I'd encourage PRs for discussion for the items I didn't get to. Thanks!
Following up on https://github.com/ga4gh/dockstore/issues/1049
Let's make sure that tool registry schema is compatible with Singularity, in particular I think this means the following although @vsoch or @pimpim may have additional pointers 1) References to a "Dockerfile" or a 'Docker image" can be generalized, maybe to "container specification" and "container image". In the comments it can be explained that examples are a Dockerfile or a Singularity Recipe http://singularity.lbl.gov/docs-recipes 2) A once over can be done to make sure that IDs are generic enough to accommodate their container naming convention which looks like
<registry>/<namespace>/<container>:<digest>
┆Issue is synchronized with this Jira Story ┆containerName: GA4GH tool-registry-service ┆friendlyId: TRS-10