Improvements on "docker biocontainers" to bio.tools metadata sync

hmenager commented 10 months ago

(discussed with @mboudet today) There are a few flaws that need to be adressed in the CI process (as implemented in https://github.com/BioContainers/ci/blob/master/github-ci/src/biocontainersci/biotools.py) that updates the metadata in the RSEc each time a new pull request is merged on the biocontainers containers repository:

Unique biocontainers filenames

We need to generate unique filenames for the biocontainers metadata files generated, e.g. instead of data/fastqc/biocontainers.yaml, https://github.com/research-software-ecosystem/content/blob/master/data/fastqc/fastqc.biocontainers.yaml. Here, the new filename pattern is data/[bio.tools ID]/[biocontainers ID].biocontainers.yaml. This will avoid collisions in case multiple containers refer to the same software in bio.tools, in which case any new container wrapping a bio.tools already packaged in another container would end replacing the contents of the previous file.

Generate files locally

biocontainers metadata files should be generated, at least as an option, in a local copy of the git repository, instead of creating a pull request, for easier testing.

Batch files generation

It would be practical to enable generating/updating metadata files for all the containers available in the repository, instead of only one, crawling all Dockerfile files in a local checkout of BioContainers/containers, and generating/updating the *.biocontainers.yaml files of a local checkout of research-software-ecosystem/content.

review metadata mapping

Have a exhaustive metadata review, to check that all metadata (at least LABEL, FROM, MAINTAINER) are mapped to the yaml file.

mboudet commented 9 months ago

Regarding the new 'filepath' (data/[bio.tools ID]/[biocontainers ID].biocontainers.yaml):

The 'bio.tools ID' is an optional part of the submitted dockerfile (as some tools do not have a biotool ID). How should we manage this situation?

(Currently, we already use the biotools.id in the path if available, else we default to the software name in the path of the provided dockerfile I believe).

mboudet commented 9 months ago

Also, regarding the 'biocontainer ID': what should we use? [Toolname][version] ?

(IE: diann:1.8.1_cv2 ? Or should we remove the _cv2, to make sure we update the tool yaml, and not create a new one?)

The cv1 / cv2 is linked to the 'biocontainer dockerfile version', and not the tool version itself (ex here). Should we have separate files?

As an example, with the 'cadd-with-script' PR, using the cadd biotool id (cadd_phred), we would have:

cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml

Each update to the Dockerfile (for the same version of cadd), would add another file.

cadd_phred/cadd-scripts-with-envs_1.6.post1_cv2.yaml
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml

And if we had a PR with cadd itself (instead of cadd-scripts-xxx), it would be

cadd_phred/cadd-scripts-with-envs_1.6.post1_cv2.yaml
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml
cadd_phred/cadd_1.6.post1_cv1.yaml
cadd_phred/cadd_1.6_cv1.yaml

hmenager commented 2 months ago

Regarding the new 'filepath' (data/[bio.tools ID]/[biocontainers ID].biocontainers.yaml):

The 'bio.tools ID' is an optional part of the submitted dockerfile (as some tools do not have a biotool ID). How should we manage this situation?

(Currently, we already use the biotools.id in the path if available, else we default to the software name in the path of the provided dockerfile I believe).

So, the way it works in the import now (didn't use to) is that: 1-all containers are imported to imports/biocontainers/[biocontainers ID].biocontainers.yaml 2-if additionally they have a biotools ID, they are also imported to data/[bio.tools ID]/[biocontainers ID].biocontainers.yaml

hmenager commented 2 months ago

Also, regarding the 'biocontainer ID': what should we use? [Toolname][version] ?

(IE: diann:1.8.1_cv2 ? Or should we remove the _cv2, to make sure we update the tool yaml, and not create a new one?)

The cv1 / cv2 is linked to the 'biocontainer dockerfile version', and not the tool version itself (ex here). Should we have separate files?

As an example, with the 'cadd-with-script' PR, using the cadd biotool id (cadd_phred), we would have:
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml
Each update to the Dockerfile (for the same version of cadd), would add another file.
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv2.yaml
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml
And if we had a PR with cadd itself (instead of cadd-scripts-xxx), it would be
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv2.yaml
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml
cadd_phred/cadd_1.6.post1_cv1.yaml
cadd_phred/cadd_1.6_cv1.yaml

I would say that if cv2 replaces cv1 but keeps the same metadata and the same tool in the same version, we should use the same ID (e.g. cadd_phred/cadd_1.6.biocontainers.yaml).

mboudet commented 2 months ago

@hmenager To sum up:

In all case, we add a file in imports/biocontainers/[biocontainers ID].biocontainers.yaml If there is a biotool ID, we add a file in data/[bio.tools ID]/[biocontainers ID].biocontainers.yaml

Regarding the biocontainers ID, there are three ways we can do it:

1) Just the software name 2) Software name + software version 3) Software name + software version + Dockerfile version (usually cv1 or cv2)

The question is:

Do we need the history (either in the file itself, or as a separate version file)?
Or do we only need the last version?

Github already takes care of versioning, and all the differents version will be in https://github.com/BioContainers anyway. If we do need the history, 1) and 2) are probably going to look weird if the software metadata change.

It might be good to have a look at what exactly we want in term of metadata content, and the formatting.

hmenager commented 1 month ago

for the biocontainers ID, we need Just the software name, and we only need the last version! for the metadata, anything which is available and maintained (e.g. not in https://github.com/BioContainers/tools-metadata) is relevant. If it contains information about the software, or how it can be accessed with BioContainers, then it's valuable.

mboudet commented 1 month ago

Just as a reminder for myself, but if we only need the last version, we need a way to skip the biotool part of the CI for some PR, juste in case someone make a PR with a older version 🤔

(Since there are many way of versioning that are difficult to parse). Maybe just setting a label skip-biotool-pr

BioContainers / ci