ga4gh / tool-registry-service-schemas

APIs for discovering genomics tools, their metadata and their containers
Apache License 2.0
30 stars 18 forks source link

Singularity compatibility #27

Closed denis-yuen closed 5 years ago

denis-yuen commented 6 years ago

Following up on https://github.com/ga4gh/dockstore/issues/1049

Let's make sure that tool registry schema is compatible with Singularity, in particular I think this means the following although @vsoch or @pimpim may have additional pointers 1) References to a "Dockerfile" or a 'Docker image" can be generalized, maybe to "container specification" and "container image". In the comments it can be explained that examples are a Dockerfile or a Singularity Recipe http://singularity.lbl.gov/docs-recipes 2) A once over can be done to make sure that IDs are generic enough to accommodate their container naming convention which looks like <registry>/<namespace>/<container>:<digest>

┆Issue is synchronized with this Jira Story ┆containerName: GA4GH tool-registry-service ┆friendlyId: TRS-10

vsoch commented 6 years ago

let me know how I can help @denis-yuen !

denis-yuen commented 6 years ago

@vsoch thanks! I've created an update to the schema on a branch that tries to generalize away from Docker to containers in general https://github.com/ga4gh/tool-registry-schemas/blob/generalize/src/main/resources/swagger/ga4gh-tool-discovery.yaml Please take a look and see if anything is fundamentally incompatible with the use of Singularity images or Singularity Hub.

Not sure if you've worked with Swagger, but I like to use the swagger editor http://editor.swagger.io/#/ (File -> Import URL)

Please ask if you have any questions or if the descriptions are unclear!

ps-account commented 6 years ago

Awesome, @denis-yuen ! ( I used swagger for the first time, using the raw link worked! https://raw.githubusercontent.com/ga4gh/tool-registry-schemas/generalize/src/main/resources/swagger/ga4gh-tool-discovery.yaml )

One suggestion for consistency, to change the "dockerfile" boolean in ToolVersion to "containerfile" Currently: dockerfile: type: boolean description: Reports if this tool has a dockerfile available.

vsoch commented 6 years ago

I know (and use) Swagger for sure! Here is my complete feedback on the specificiation, and a few general points:

https://singularityhub.github.io/sregistry-cli

swagger: '2.0'
info:
  title: GA4GH Tool Discovery API
  description: >-
    Proposed API for GA4GH tool repositories. A tool consists of a set of container images that are paired with a set of documents (examples include CWL or WDL) that describes how to use those images and a set of specifications for those images (examples are Dockerfiles or Singularity recipes) that describe how to re-produce those images in the future. We use the following terminology, an "container image" describes a container as stored at rest on a filesystem, a "tool" describes one of the triples as described above. In practice, examples of "tools" include CWL CommandLineTools, CWL Workflows, WDL workflows, and Nextflow workflows that reference containers in formats such as Docker or Singularity. 

Acronyms tend to only be known by a small number. What about a name that goes more for the Google culture (name it what it is!) If this is a container discovery API, why not call it that.

  version: 2.0.0
produces:
  - application/json
  - text/plain
basePath: /api/ga4gh/v2

was there a version 1 already?

tags:
  - name: GA4GH
    description: A set of resources proposed as a common standard for tool repositories
paths:

Just wondering, what are tags for? Just metadata? If it's some kind of swagger search feature, it might make sense to extend this list a bit so people interested in containers can find it.

  '/tools/{id}':
    get:
      summary: 'List one specific tool, acts as an anchor for self references'
      description: >-
        This endpoint returns one specific tool (which has ToolVersions nested inside it)
      tags:
        - GA4GH
      parameters:
        - name: id
          in: path
          required: true
          type: string
          description: >-
            A unique identifier of the tool, scoped to this registry, for
            example `123456`

I might be just using this wrongly, but I think the id here would corresponds with the tool uri (e.g., for singularity we would say docker:// or shub://. But on second thought, id might be more appropriate because it's very general.

      responses:
        '200':
          description: A tool.
          schema:
            $ref: '#/definitions/Tool'

So you don't need to list the other potential responses (e.g., 404 not found, 400/401 for Unathorized and Authentication needed). I would imagine that some tools are not publicly available? I'll make this comment once, and just take note to think about for all the other API calls.

  '/tools/{id}/versions':
    get:
      summary: List versions of a tool
      description: Returns all versions of the specified tool
      tags:
        - GA4GH
      parameters:
        - name: id
          in: path
          required: true
          type: string
          description: >-
            A unique identifier of the tool, scoped to this registry, for
            example `123456`
      responses:
        '200':
          description: An array of tool versions
          schema:
            type: array
            items:
              $ref: '#/definitions/ToolVersion'
  '/tools/{id}/versions/{version_id}':
    get:
      summary: 'List one specific tool version, acts as an anchor for self references'
      description: This endpoint returns one specific tool version
      tags:
        - GA4GH
      parameters:
        - name: id
          in: path
          required: true
          type: string
          description: >-
            A unique identifier of the tool, scoped to this registry, for
            example `123456`
        - name: version_id
          in: path
          required: true
          type: string
          description: >-
            An identifier of the tool version, scoped to this registry, for
            example `v1`
      responses:
        '200':
          description: A tool version.
          schema:
            $ref: '#/definitions/ToolVersion'

Maybe each version of the tool would have an associated url or uri? For example, sregistry client (if following this schema) would have a direct link to the Github repo tag, or release on pypi.

  /tools:
    get:
      summary: List all tools
      description: >
        This endpoint returns all tools available or a filtered subset using
        metadata query parameters.
      tags:
        - GA4GH
      parameters:
        - name: id
          type: string
          in: query
          description: >-
            A unique identifier of the tool, scoped to this registry, for
            example `123456`
        - name: registry
          in: query
          type: string
          description: The image registry that contains the image.

oh interesting, so this means that one registry could have more than one tool, but one tool would need a separate record for each registry?

        - name: organization
          in: query
          type: string
          description: The organization in the registry that published the image.

Do you think organization is a useful field to have?

        - name: name
          in: query
          type: string
          description: The name of the image.
        - name: toolname

If you have something like Docker with layers, does this actually reference the manifest/digest for layers to make up an image?

          in: query
          type: string
          description: The name of the tool.
        - name: description
          in: query
          type: string
          description: The description of the tool.
        - name: author
          in: query
          type: string
          description: >-
            The author of the tool (TODO a thought occurs, are we assuming that
            the author of the CWL and the image are the same?).
        - $ref: '#/parameters/offset'
        - $ref: '#/parameters/limit'

What about an external link / file with a list of authors? Or a list of authors here?

     responses:
        '200':
          description: An array of Tools that match the filter.
          schema:
            type: array
            items:
              $ref: '#/definitions/Tool'
          headers:
            next_page:
              description: >-
                A URL that can be used to reach the next page based on the
                current offset and page record limit
              type: string
            last_page:
              description: >-
                A URL that can be used to reach the last page based on the
                current page record limit
              type: string
            current_offset:
              description: The current start index of the paging used for this result
              type: string
            current_limit:
              description: The current page record limit used for this result
              type: integer

Don't forget the all mightly selfLink!

  '/tools/{id}/versions/{version_id}/{type}/descriptor':
    get:
      summary: Get the tool descriptor for the specified tool.
      description: Returns the descriptor for the specified tool (examples include WDL, CWL, or Nextflow documents).
      tags:
        - GA4GH
      parameters:
        - name: type
          required: true
          in: path
          description: >-
            The output type of the descriptor. If not specified it is up to the
            underlying implementation to determine which output type to return.
            Plain types return the bare descriptor while the "non-plain" types
            return a descriptor wrapped with metadata. Allowable values include
            "CWL", "WDL", "NFL", "PLAIN_CWL", "PLAIN_WDL", "PLAIN_NFL".

Why did the theme get very workflow heavy / oriented all of a sudden? I do a lot with Singularity / Docker, and even scientific workflows, but I can't say I've used any of those much :)

          type: string
        - name: id
          in: path
          description: >-
            A unique identifier of the tool, scoped to this registry, for
            example `123456`
          required: true
          type: string
        - name: version_id
          in: path
          required: true
          type: string
          description: >-
            An identifier of the tool version for this particular tool registry,
            for example `v1`
      responses:
        '200':
          description: The tool descriptor.
          schema:
            $ref: '#/definitions/ToolDescriptor'
        '404':
          description: The tool can not be output in the specified type.
          schema:
            $ref: '#/definitions/Error'
  '/tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path}':
    get:
      summary: Get additional tool descriptor files relative to the main file
      description: >-
        Descriptors can often include imports that refer to additional descriptors. This returns additional descriptors for the specified tool in the same or other directories that can be reached as a relative path. This endpoint can be useful for workflow engine implementations like cwltool to programmatically download all the descriptors for a tool and run it
      tags:
        - GA4GH
      parameters:
        - name: type
          in: path
          required: true
          description: >-
            The output type of the descriptor. If not specified it is up to the
            underlying implementation to determine which output type to return.
            Plain types return the bare descriptor while the "non-plain" types
            return a descriptor wrapped with metadata. Allowable values are
            "CWL", "WDL", "NFL", "PLAIN_CWL", "PLAIN_WDL", "PLAIN_NFL".
          type: string
        - name: id
          in: path
          description: >-
            A unique identifier of the tool, scoped to this registry, for example `123456`
          required: true
          type: string
        - name: version_id
          in: path
          required: true
          type: string
          description: >-
            An identifier of the tool version for this particular tool registry, for example `v1`
        - name: relative_path
          in: path
          required: true
          type: string
          description: >-
            A relative path to the additional file (same directory or
            subdirectories), for example 'foo.cwl' would return a 'foo.cwl' from
            the same directory as the main descriptor. 'nestedDirectory/foo.cwl' would return the file 
            from a nested subdirectory
      responses:
        '200':
          description: The tool descriptor.
          schema:
            $ref: '#/definitions/ToolDescriptor'
        '404':
          description: The tool can not be output in the specified type.
          schema:
            $ref: '#/definitions/Error'

Interesting, I think it just swung in a direction of being an API for workflow tools? I think the API or spec should probably be agnostic to the actual kind/type of tool, meaning that you could allow for more of a registry, or more of a workflow thing. I might not have a good understanding of what you are describing - could you give me a few sentence summary (in easy to understand, baby dinosaur terms, haha).

  '/tools/{id}/versions/{version_id}/{type}/tests':
    get:
      summary: Get an array of test JSONs suitable for use with this descriptor type.
      tags:
        - GA4GH
      parameters:
        - name: type
          required: true
          in: path
          description: >-
            The output type of the descriptor. If not specified it is up to the
            underlying implementation to determine which output type to return.
            Plain types return the bare descriptor while the "non-plain" types
            return a descriptor wrapped with metadata. Allowable values are
            "CWL", "WDL", "NFL", "PLAIN_CWL", "PLAIN_WDL", and "PLAIN_NFL"
          type: string
        - name: id
          in: path
          description: >-
            A unique identifier of the tool, scoped to this registry, for
            example `123456`
          required: true
          type: string
        - name: version_id
          in: path
          required: true
          type: string
          description: >-
            An identifier of the tool version for this particular tool registry,
            for example `v1`
      responses:
        '200':
          description: The tool test JSON response.
          schema:
            type: array
            items:
              $ref: '#/definitions/ToolTests'
        '404':
          description: The tool can not be output in the specified type.
          schema:
            $ref: '#/definitions/Error'
  '/tools/{id}/versions/{version_id}/containerfile':
    get:
      summary: Get the container specification(s) for the specified image.
      description: Returns the container specifications(s) for the specified image. For example, a CWL CommandlineTool can be associated with one specification for a container, a CWL Workflow can be associated with multiple specifications for containers 
      tags:
        - GA4GH
      parameters:
        - name: id
          in: path
          description: >-
            A unique identifier of the tool, scoped to this registry, for
            example `123456`
          required: true
          type: string
        - name: version_id
          in: path
          required: true
          type: string
          description: >-
            An identifier of the tool version for this particular tool registry,
            for example `v1`
      responses:
        '200':
          description: The tool payload.
          schema:
            type: array
            items:
              $ref: '#/definitions/ToolContainerfile'
        '404':
          description: There are no container specifications for this tool
          schema:
            $ref: '#/definitions/Error'

I wouldn't necessarily call this tests - a test is an assertion that something is, or isn't. A specification is a build recipe for a thing.

  /metadata:
    get:
      summary: Return some metadata that is useful for describing this registry
      description: Return some metadata that is useful for describing this registry
      tags:
        - GA4GH
      responses:
        '200':
          description: A Metadata object describing this service.
          schema:
            $ref: '#/definitions/Metadata'

I like this one a lot :) This is where you would put things like labels, environment. It maps nicely to the Singularity inspect command. Docker has that too. Maybe call it inspect instead of metadata?

  /toolClasses:
    get:
      summary: List all tool types
      description: |
        This endpoint returns all tool-classes available
      tags:
        - GA4GH
      responses:
        '200':
          description: A list of potential tool classes.
          schema:
            type: array
            items:
              $ref: '#/definitions/ToolClass'

These classes are variables, or pre-defined by the specification? What are they? Going to skip some here...

...
      signed:
        type: boolean
        description: Reports whether this tool has been signed.
      versions:
        description: A list of versions for this tool
        type: array
        items:
          $ref: '#/definitions/ToolVersion'

The one thing maybe missing here is the distinction between a version, and then something like a url/uri to always be able to get that version.


      dockerfile:
        type: boolean
        description: Reports if this tool has a dockerfile available.
      meta_version:
        type: string
        description: >-
          The version of this tool version in the registry. Iterates when fields
          like the description, author, etc. are updated.
      verified:
        type: boolean
        description: >-
          Reports whether this tool has been verified by a specific organization
          or individual
      verified_source:
        type: string
        description: >-
          Source of metadata that can support a verified tool, such as an email
          or URL

Do you want to stick to the standard of a Dockerfile? The Singularity equivalent is called Singularity. I think if a tool is defined, the tool should be up to defining what it's build recipe can be called. But if you believe Docker is standard enough that most will at least be able to ask the question "Do we have a Dockerfile" then this is probably ok.

  DescriptorType:
    type: string
    enum:
      - CWL
      - WDL
      - NFL

Again, I would again question why the focus on workflow stuffs! This is coming out really cool!

Overall, for compatability with Singularity we would want to have places for:

A Singularity image can be started as an instance, and that means having a start script. So for now I would distinguish those two different kinds of Singularity images - just container executables, and container executables + instances.

Docker also has different manifests that are determined based on the user's wanted operating system and architecture (see the schemaVersion 2.0) so that might be something to take into account.

denis-yuen commented 6 years ago

Hi,

Reading from top to bottom: @pimpim made the change to toolversion, missed that one!

@vsoch

In that this is a discovery (or other interface) for finding containers, it sounds a lot like the Singularity Registry Global client I've been working on ...

Cool, I wasn't aware of this project, it looks pretty interesting. The tool registry schema is an interface for finding workflows which happen to use containers, so there definitely seems like there could be some overlap in that a Singularity Registry Global client could look at this API to find containers to use.

was there a version 1 already?

Yup, but we needed to iterate to 2 since we had some input that version 1 did not match some recommendations to make it more protobuf/Javascript friendly that were unfortunately not backwards compatible.

Just wondering, what are tags for?

AFAIK, they're just used to organize endpoints in the swagger editor/ui. In other words, if your server implements a lot of endpoints for other purposes, you can group the these ones together for easy viewing.

(Edit to add: re-reading, they also group generated client API methods and server endpoints when using swagger-codegen)

So you don't need to list the other potential responses (e.g., 404 not found, 400/401 for Unathorized and Authentication needed)

To be thorough, we should probably enumerate more responses

so this means that one registry could have more than one tool, but one tool would need a separate record for each registry

So an example, this API might have workflows that use either Docker images from quay.io or Docker hub. Setting this to "quay.io" would return the workflows that use images from quay.io. Workflows could use images from both registries, and I would assume this would return them too (e.g. or rather than and)

Do you think organization is a useful field to have?

I think so, for example filtering tools from https://quay.io/organization/epigenomicscrew or https://quay.io/organization/ga4gh-dream

If you have something like Docker with layers, does this actually reference the manifest/digest for layers to make up an image?

I'm guessing no. Do you have a use-case for sharing this?

What about an external link / file with a list of authors? Or a list of authors here?

I'm fine with either if you prefer one.

Why did the theme get very workflow heavy / oriented all of a sudden?

I tried to describe this in the intro, perhaps not as well as I could have. In short, for Dockstore we found a lot of value in describing containers in some way. The easy example is CWL (Common Workflow Language) which can describe a single container as a CommandLineTool or strung together in a Workflow. This provides a handy place to describe the parameters that a container takes, what input files it might use, output files, metadata, etc. So we sought out similarly minded folks and found the GA4GH (Global Alliance for Genomics and Health), specifically the Containers and Workflows task team (now called the Cloud Work Stream). Hopefully that explains both the history and the practical reason we approach it this way.

There's a bit of description under info -> description, which I probably need to beef up.

I wouldn't necessarily call this tests - a test is an assertion that something is, or isn't. A specification is a build recipe for a thing.

There's actually both here. Build recipes are returned by '/tools/{id}/versions/{version_id}/containerfile' and tests are returned by /tools/{id}/versions/{version_id}/{type}/tests. The idea with tests is that they're supposed to be sets of parameters that run a workflow successfully. This is used well in the GA4GH-DREAM challenge where people exchange dockerized workflows to be run and see how many platforms successfully run them (It's entirely possible that an implementation of this API doesn't have these in which case, it would just return an empty array)

Do you want to stick to the standard of a Dockerfile?

Caught above, I changed most references to a containerfile but missed this one.

Ok, this is getting longish, so breaking out the overall bit into a separate comment.

denis-yuen commented 6 years ago

So overall:

the registry url to download

Sure, do you have an example for me to put as an example description? We normally just pull the image ( https://github.com/ga4gh/tool-registry-schemas/blob/develop/src/main/resources/swagger/ga4gh-tool-discovery.yaml#L381 would be docker pull quay.io/seqware/seqware_full/1.1 as an example)

the uri (name) of the image

ok

a link to the build recipe (Singularity.* file)

This was meant to be ToolContainerFile https://github.com/ga4gh/tool-registry-schemas/blob/generalize/src/main/resources/swagger/ga4gh-tool-discovery.yaml#L512 which can store a URL to a recipe and/or the recipe directly.

one or more container entrypoints (e.g., see Scientific Filesystem https://vsoch.github.io/scif)

We tend to have the CWL or WDL documents describe the entrypoint. Can the scientific filesystem be used as a descriptor?

inspect metadata (labels, definition file, runscript, environment) some containers are intended to work with host drivers via the --nv flag what about the image format? E.g., Singularity started as ext3, now is squashfs, which means not writable.

I think maybe you have more experience than I with these options and can maybe make a PR or similar for these?

Hopefully that's a good start. I'll try to incorporate some of the more straight-forward changes directly.

ps-account commented 6 years ago

Great comments! Considering the --nv: Maybe an idea to make a general custom flag entry for containers?

ps-account commented 6 years ago

As a side note: does ga4gh have a provision for nvidia-docker? Though maybe not (yet) a large issue in the context of ga4gh.

denis-yuen commented 6 years ago

For nvidia-docker, probably not (yet?). Since this API was intended to share the container image (what to run) and the descriptor (how to run it), this sounds like we would currently punt to the workflow language. i.e. I don't know how custom hints like that are dealt with in languages like CWL, might be a question for @mr-c or @tetron

(I know there was some work on things like udocker, there a CWL Requirement for that like a http://www.commonwl.org/v1.0/CommandLineTool.html#DockerRequirement for Docker?)

Edit to add: should probably have mentioned at the top that if you wish to play with a live copy of the non-generalized v1 version to see what it might look like in Docker-land, you can play around with https://dockstore.org:8443/static/swagger-ui/index.html#/GA4GH

denis-yuen commented 6 years ago

FYI, was reviewing changes for CWL 1.1 and I did run into https://github.com/common-workflow-language/common-workflow-language/issues/587 and https://github.com/common-workflow-language/common-workflow-language/issues/374 which seem to be the way CWL tackles GPU workflows

denis-yuen commented 5 years ago

Thanks for the heads-up @rishidev The above commit has been merged but there is probably good feedback above as well. However, I think in the spirit of the Monday call to move things along, I'd encourage PRs for discussion for the items I didn't get to. Thanks!