ResearchObject / workflow-run-crate

Workflow Run RO-Crate profile
https://www.researchobject.org/workflow-run-crate/
Apache License 2.0
8 stars 9 forks source link

CQ1 - Container image #9

Closed simleo closed 1 year ago

simleo commented 2 years ago

What container images (e.g., Docker) were used by the run?

mr-c commented 2 years ago

For containers, capturing the digest (checksum) of the actual container run should be the minimum along the host OS w/ version (previous Linux kernel versions have had math bugs); CPU info (basically the contents of /proc/cpuinfo)

ilveroluca commented 2 years ago

A File seems too restrictive.

One should be able to reference a container image in any remote repository. Also, it seems handy to be able to define what type of container image it is (e.g., Singularity, Docker, etc.).

Both these requirements could be satisfied by using a full URI, where the scheme is used to identify the image type. This approach is also used by Snakemake.

simleo commented 2 years ago

Thanks @ilveroluca. Following your link, it looks like Snakemake, in turn, accepts what's supported by Singularity. So the spec could say something like "values for image SHOULD be in the format accepted by Singularity. e.g. docker://quay.io/calico/node". So formats that respect the SHOULD could be used by tooling that want to enable reproducibility, with the ability to actually pull the image. More "informational" (non-pullable) URLs, e.g. of a web page that describes the image would still be useful for traceability.

dgarijo commented 2 years ago

The registry where the container is (e.g., Dockerhub, GitHub, etc.) is quite important here as well. I propose capturing it (in case a file is not used, just the id in that registry)

simleo commented 2 years ago

The registry where the container is (e.g., Dockerhub, GitHub, etc.) is quite important here as well. I propose capturing it (in case a file is not used, just the id in that registry)

I think the idea discussed yesterday was to capture, using separate properties:

simleo commented 1 year ago

Since the question is what container images were used by the run, the source entity should be CreateAction. I've updated the issue description, which listed SoftwareApplication instead.

As for the property used to link to the image, Reusing image does not feel quite right, since it's meant for pictures. We could define a containerImage property in ro-terms and make it point to a ContainerImage entity (this pattern is not uncommon in Schema.org, e.g. contactPoint pointing to ContactPoint). The question then would be how to structure ContainerImage.

For the image type we could use additionalType, and define DockerImage and SIFImage (https://github.com/apptainer/sif) in ro-terms for the values for now.

For the registry we should define a custom property, which could be registry and take textual values. This is implicitly "docker.io" when not specified in docker pull. Singularity (both SingularityCE and Apptainer) seems to allow pulling from an arbitrary http(s) URL: in this case the ContainerImage should probably have a url property instead, listing the full image URL.

Referring to the previous comment, the problem with the "organization" bit is that it's not always an organization. Keeping as reference the docker pull scheme and the Docker Hub, that field could represent a user, or be missing in the case of an official image. In practice, for official images it defaults to "library", so that docker pull debian is short for docker pull docker.io/library/debian:latest. So it's probably better not to have such a field and consider this part of the image name instead, which is consistent with the docker pull [OPTIONS] NAME[:TAG|@DIGEST] docker pull syntax.

For the image name we can use name, mapping to text like "debian", "biocontainer/samtools", etc. Note that the terminology is not always consistent in the Docker docs: e.g., what is referred to as "name" in the docker pull docs is called "repository" in the docker images docs.

The tag needs a new custom property that we can call tag, with textual values

For the digest, we already have sha256 in ro-terms.

Here is a possible example:

{
    "@id": "#cb04c897-eb92-4c53-8a38-bcc1a16fd650",
    "@type": "CreateAction",
    "instrument": {"@id": "bam2fastq.cwl"},
    ...
    "containerImage": {"@id": "#samtools-image"}
},
{
    "@id": "#samtools-image",
    "@type": "ContainerImage",
    "additionalType": "DockerImage",
    "registry": "docker.io",
    "name": "biocontainers/samtools",
    "tag": "v1.9-4-deb_cv1",
    "sha256": "da61624fda230e94867c9429ca1112e1e77c24e500b52dfc84eaf2f5820b4a2a"
}
ilveroluca commented 1 year ago

While I think your proposal for ContainerImage will work in practice, I have some considerations.

The only think that really identifies the image is the checksum. On the other hand, it's possible that images are mirrored in multiple locations, or that over time they migrate across repositories. Tags can also be reused (while this is not a best practice, it can happen).

Also, I question the value added by splitting the image URL into its components (i.e., registry, name, tag).

I would therefore consider defining a ContainerImage that: 1) uses a "simple" URL to references image locations; and 2) allows referencing secondary image locations. An example might look like this:

{
    "@id": "#samtools-image",
    "@type": "ContainerImage",
    "additionalType": "DockerImage",
    "sha256": "da61624fda230e94867c9429ca1112e1e77c24e500b52dfc84eaf2f5820b4a2a",
    "mainUrl": "https://docker.io/biocontainers/samtools:v1.9-4-deb_cv1",
    "alternativeUrls": [ "https://quay.io/repository/...." ]
}
simleo commented 1 year ago

One problem with "https://docker.io/biocontainers/samtools:v1.9-4-deb_cv1" is that it does not represent a resource on the web: it leads to a "page not found" if entered on a browser and you cannot docker pull it ("invalid reference format"). What you can docker pull is:

The separate fields would allow the consumer to build the preferred pull syntax easily by joining the relevant parts, and also to perform more articulate queries (e.g., all images from quay.io).

That's for Docker images at least, since Singularity allows pulling by URL.

rsirvent commented 1 year ago

Here is a possible example:

{
    "@id": "#cb04c897-eb92-4c53-8a38-bcc1a16fd650",
    "@type": "CreateAction",
    "instrument": {"@id": "bam2fastq.cwl"},
    ...
    "containerImage": {"@id": "#samtools-image"}
},
{
    "@id": "#samtools-image",
    "@type": "ContainerImage",
    "additionalType": "DockerImage",
    "registry": "docker.io",
    "name": "biocontainers/samtools",
    "tag": "v1.9-4-deb_cv1",
    "sha256": "da61624fda230e94867c9429ca1112e1e77c24e500b52dfc84eaf2f5820b4a2a"
}

I'm more in favor of this approach, since it describes in more details the image, thus you are getting richer metadata than can later be used.

simleo commented 1 year ago

@stain any thoughts on this one?

simleo commented 1 year ago

As noted by Stian, "additionalType": {"@id": "https://w3id.org/ro/terms/workflow-run#DockerImage"} is more correct.