Closed simleo closed 1 year ago
For containers, capturing the digest (checksum) of the actual container run should be the minimum along the host OS w/ version (previous Linux kernel versions have had math bugs); CPU info (basically the contents of /proc/cpuinfo
)
A File
seems too restrictive.
One should be able to reference a container image in any remote repository. Also, it seems handy to be able to define what type of container image it is (e.g., Singularity, Docker, etc.).
Both these requirements could be satisfied by using a full URI, where the scheme is used to identify the image type. This approach is also used by Snakemake.
Thanks @ilveroluca. Following your link, it looks like Snakemake, in turn, accepts what's supported by Singularity. So the spec could say something like "values for image
SHOULD be in the format accepted by Singularity. e.g. docker://quay.io/calico/node
". So formats that respect the SHOULD could be used by tooling that want to enable reproducibility, with the ability to actually pull the image. More "informational" (non-pullable) URLs, e.g. of a web page that describes the image would still be useful for traceability.
The registry where the container is (e.g., Dockerhub, GitHub, etc.) is quite important here as well. I propose capturing it (in case a file is not used, just the id in that registry)
The registry where the container is (e.g., Dockerhub, GitHub, etc.) is quite important here as well. I propose capturing it (in case a file is not used, just the id in that registry)
I think the idea discussed yesterday was to capture, using separate properties:
JenkinsService
, GithubService
, TravisService
)REGISTRY/ORG/IMAGE:TAG
scheme used by docker pull
. E.g. "quay.io"
rather than "https://quay.io/"
. An alternative is pointing to more articulate objects that would have a generic URL pointing to a descriptive web page and a specialized property for mapping to the appropriate field in the image pull command."sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
. Like for Registry, though, how to map to the pull syntax (e.g. docker pull ubuntu@sha256:82becede498899ec668628e7cb0ad87b6e1c371cb8a1e597d83a47fac21d6af3
) should be made straightforward.Since the question is what container images were used by the run, the source entity should be CreateAction
. I've updated the issue description, which listed SoftwareApplication
instead.
As for the property used to link to the image, Reusing image does not feel quite right, since it's meant for pictures. We could define a containerImage
property in ro-terms and make it point to a ContainerImage
entity (this pattern is not uncommon in Schema.org, e.g. contactPoint
pointing to ContactPoint
). The question then would be how to structure ContainerImage
.
For the image type we could use additionalType, and define DockerImage
and SIFImage
(https://github.com/apptainer/sif) in ro-terms for the values for now.
For the registry we should define a custom property, which could be registry
and take textual values. This is implicitly "docker.io" when not specified in docker pull
. Singularity (both SingularityCE and Apptainer) seems to allow pulling from an arbitrary http(s) URL: in this case the ContainerImage
should probably have a url
property instead, listing the full image URL.
Referring to the previous comment, the problem with the "organization" bit is that it's not always an organization. Keeping as reference the docker pull
scheme and the Docker Hub, that field could represent a user, or be missing in the case of an official image. In practice, for official images it defaults to "library", so that docker pull debian
is short for docker pull docker.io/library/debian:latest
. So it's probably better not to have such a field and consider this part of the image name instead, which is consistent with the docker pull [OPTIONS] NAME[:TAG|@DIGEST]
docker pull syntax.
For the image name we can use name, mapping to text like "debian", "biocontainer/samtools", etc. Note that the terminology is not always consistent in the Docker docs: e.g., what is referred to as "name" in the docker pull
docs is called "repository" in the docker images docs.
The tag needs a new custom property that we can call tag
, with textual values
For the digest, we already have sha256
in ro-terms.
Here is a possible example:
{
"@id": "#cb04c897-eb92-4c53-8a38-bcc1a16fd650",
"@type": "CreateAction",
"instrument": {"@id": "bam2fastq.cwl"},
...
"containerImage": {"@id": "#samtools-image"}
},
{
"@id": "#samtools-image",
"@type": "ContainerImage",
"additionalType": "DockerImage",
"registry": "docker.io",
"name": "biocontainers/samtools",
"tag": "v1.9-4-deb_cv1",
"sha256": "da61624fda230e94867c9429ca1112e1e77c24e500b52dfc84eaf2f5820b4a2a"
}
While I think your proposal for ContainerImage will work in practice, I have some considerations.
The only think that really identifies the image is the checksum. On the other hand, it's possible that images are mirrored in multiple locations, or that over time they migrate across repositories. Tags can also be reused (while this is not a best practice, it can happen).
Also, I question the value added by splitting the image URL into its components (i.e., registry, name, tag).
I would therefore consider defining a ContainerImage that: 1) uses a "simple" URL to references image locations; and 2) allows referencing secondary image locations. An example might look like this:
{
"@id": "#samtools-image",
"@type": "ContainerImage",
"additionalType": "DockerImage",
"sha256": "da61624fda230e94867c9429ca1112e1e77c24e500b52dfc84eaf2f5820b4a2a",
"mainUrl": "https://docker.io/biocontainers/samtools:v1.9-4-deb_cv1",
"alternativeUrls": [ "https://quay.io/repository/...." ]
}
One problem with "https://docker.io/biocontainers/samtools:v1.9-4-deb_cv1" is that it does not represent a resource on the web: it leads to a "page not found" if entered on a browser and you cannot docker pull
it ("invalid reference format"). What you can docker pull
is:
docker.io/biocontainers/samtools:v1.9-4-deb_cv1
docker.io/biocontainers/samtools@sha256:da61624fda230e94867c9429ca1112e1e77c24e500b52dfc84eaf2f5820b4a2a
The separate fields would allow the consumer to build the preferred pull syntax easily by joining the relevant parts, and also to perform more articulate queries (e.g., all images from quay.io
).
That's for Docker images at least, since Singularity allows pulling by URL.
Here is a possible example:
{ "@id": "#cb04c897-eb92-4c53-8a38-bcc1a16fd650", "@type": "CreateAction", "instrument": {"@id": "bam2fastq.cwl"}, ... "containerImage": {"@id": "#samtools-image"} }, { "@id": "#samtools-image", "@type": "ContainerImage", "additionalType": "DockerImage", "registry": "docker.io", "name": "biocontainers/samtools", "tag": "v1.9-4-deb_cv1", "sha256": "da61624fda230e94867c9429ca1112e1e77c24e500b52dfc84eaf2f5820b4a2a" }
I'm more in favor of this approach, since it describes in more details the image, thus you are getting richer metadata than can later be used.
@stain any thoughts on this one?
As noted by Stian, "additionalType": {"@id": "https://w3id.org/ro/terms/workflow-run#DockerImage"}
is more correct.
What container images (e.g., Docker) were used by the run?
File
if the image is a tarball fromdocker save