Open hmenager opened 10 months ago
Regarding the new 'filepath' (data/[bio.tools ID]/[biocontainers ID].biocontainers.yaml
):
The 'bio.tools ID' is an optional part of the submitted dockerfile (as some tools do not have a biotool ID). How should we manage this situation?
(Currently, we already use the biotools.id in the path if available, else we default to the software name in the path of the provided dockerfile I believe).
Also, regarding the 'biocontainer ID': what should we use? [Toolname][version] ?
(IE: diann:1.8.1_cv2 ? Or should we remove the _cv2, to make sure we update the tool yaml, and not create a new one?)
The cv1 / cv2 is linked to the 'biocontainer dockerfile version', and not the tool version itself (ex here). Should we have separate files?
As an example, with the 'cadd-with-script' PR, using the cadd biotool id (cadd_phred), we would have:
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml
Each update to the Dockerfile (for the same version of cadd), would add another file.
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv2.yaml
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml
And if we had a PR with cadd itself (instead of cadd-scripts-xxx), it would be
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv2.yaml
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml
cadd_phred/cadd_1.6.post1_cv1.yaml
cadd_phred/cadd_1.6_cv1.yaml
Regarding the new 'filepath' (
data/[bio.tools ID]/[biocontainers ID].biocontainers.yaml
):The 'bio.tools ID' is an optional part of the submitted dockerfile (as some tools do not have a biotool ID). How should we manage this situation?
(Currently, we already use the biotools.id in the path if available, else we default to the software name in the path of the provided dockerfile I believe).
So, the way it works in the import now (didn't use to) is that:
1-all containers are imported to imports/biocontainers/[biocontainers ID].biocontainers.yaml
2-if additionally they have a biotools ID, they are also imported to data/[bio.tools ID]/[biocontainers ID].biocontainers.yaml
Also, regarding the 'biocontainer ID': what should we use? [Toolname][version] ?
(IE: diann:1.8.1_cv2 ? Or should we remove the _cv2, to make sure we update the tool yaml, and not create a new one?)
The cv1 / cv2 is linked to the 'biocontainer dockerfile version', and not the tool version itself (ex here). Should we have separate files?
As an example, with the 'cadd-with-script' PR, using the cadd biotool id (cadd_phred), we would have:
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml
Each update to the Dockerfile (for the same version of cadd), would add another file.
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv2.yaml cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml
And if we had a PR with cadd itself (instead of cadd-scripts-xxx), it would be
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv2.yaml cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml cadd_phred/cadd_1.6.post1_cv1.yaml cadd_phred/cadd_1.6_cv1.yaml
I would say that if cv2 replaces cv1 but keeps the same metadata and the same tool in the same version, we should use the same ID (e.g. cadd_phred/cadd_1.6.biocontainers.yaml
).
@hmenager To sum up:
In all case, we add a file in imports/biocontainers/[biocontainers ID].biocontainers.yaml
If there is a biotool ID, we add a file in data/[bio.tools ID]/[biocontainers ID].biocontainers.yaml
Regarding the biocontainers ID
, there are three ways we can do it:
1) Just the software name 2) Software name + software version 3) Software name + software version + Dockerfile version (usually cv1 or cv2)
The question is:
Github already takes care of versioning, and all the differents version will be in https://github.com/BioContainers anyway. If we do need the history, 1) and 2) are probably going to look weird if the software metadata change.
It might be good to have a look at what exactly we want in term of metadata content, and the formatting.
for the biocontainers ID, we need Just the software name, and we only need the last version! for the metadata, anything which is available and maintained (e.g. not in https://github.com/BioContainers/tools-metadata) is relevant. If it contains information about the software, or how it can be accessed with BioContainers, then it's valuable.
Just as a reminder for myself, but if we only need the last version, we need a way to skip the biotool part of the CI for some PR, juste in case someone make a PR with a older version 🤔
(Since there are many way of versioning that are difficult to parse). Maybe just setting a label skip-biotool-pr
(discussed with @mboudet today) There are a few flaws that need to be adressed in the CI process (as implemented in
https://github.com/BioContainers/ci/blob/master/github-ci/src/biocontainersci/biotools.py
) that updates the metadata in the RSEc each time a new pull request is merged on the biocontainers containers repository:Unique biocontainers filenames
We need to generate unique filenames for the biocontainers metadata files generated, e.g. instead of
data/fastqc/biocontainers.yaml
,https://github.com/research-software-ecosystem/content/blob/master/data/fastqc/fastqc.biocontainers.yaml
. Here, the new filename pattern isdata/[bio.tools ID]/[biocontainers ID].biocontainers.yaml
. This will avoid collisions in case multiple containers refer to the same software in bio.tools, in which case any new container wrapping a bio.tools already packaged in another container would end replacing the contents of the previous file.Generate files locally
biocontainers metadata files should be generated, at least as an option, in a local copy of the git repository, instead of creating a pull request, for easier testing.
Batch files generation
It would be practical to enable generating/updating metadata files for all the containers available in the repository, instead of only one, crawling all
Dockerfile
files in a local checkout of BioContainers/containers, and generating/updating the*.biocontainers.yaml
files of a local checkout of research-software-ecosystem/content.review metadata mapping
Have a exhaustive metadata review, to check that all metadata (at least LABEL, FROM, MAINTAINER) are mapped to the yaml file.