BioContainers / specs

BioContainers specifications
http://biocontainers.pro
Apache License 2.0
49 stars 12 forks source link

BioTools and BioContainers integration. #84

Closed ypriverol closed 3 years ago

ypriverol commented 6 years ago

For each BioContainers / BioConda package we should annotate the bio.tools identifier in the way:

LABEL BIOTOOLS="https://bio.tools/comet"

This will be available in the metadata and Biocontiners API would be able to retrieve this information from the Bio.tools.

If we have a material that are in TESS, we should have a system such as the one discussed in https://github.com/BioContainers/specs/issues/78:

LABEL TESS="https://www.ebi.ac.uk/training/online/course/phenomenal-accessing-metabolomics-workflows-galaxy"

ypriverol commented 6 years ago

You can eventually link to other initiatives such as omicstools (https://omictools.com):

LABEL OMICSTOOLS="https://omictools.com/comet-3-tool"

ypriverol commented 6 years ago

@mr-c Recomendations http://label-schema.org/rc1/

osallou commented 6 years ago

to get inline with others labels, shouldn't be lowercase ? (LABEL biotools=...)

ypriverol commented 6 years ago

Should be the identifiers encoded in the Dockerfile/Conda recipes? I will list the pros and cons and we can take a decision to move this forward.

Pros:

  1. Each recipe will be self described. This means that anyone can take those recipes and find the external reference information.
  2. Developers and community can help to perform the annotation in a more easy/straightforward way.

Cons:

  1. Inject an external source of information into the technical recipe with the corresponding disadvantages on that: 1.1. Updates in the identifiers will trigger updates in our recipes, which actually is not a good practice because this is not a technical change. 1.2. Identifiers and external sources tends to change more often that other things due major reasons: a) resources change their identifier schema, b) resources disappear or new resources are added, then new identifiers need to be added.

  2. Most of the recipes will be post-annotated. That means that we will have first the recipe and after a process of creation of pubmed or bio.tools, then we will annotate our recipe.

I have been looking into other resources such as bioconductor that has a similar problem because they have the description of the package and them an annotation where they add the corresponding information such as publication and external urls. My recommendation, for now, is to have a central place in biocontainers, where we can store this information persistence in github, for example:

biocontainers_bioconda_id external_id

Here a full example (https://github.com/BioContainers/biotools-bioconda-ids/blob/master/mapping.csv). When a package is created we ask to add the package to this file and then we do a PR. This is similar to what we created at the very beginning with mulled. With this approach, we can even update this table without updating the recipe and the corresponding image. Also, we can call for contributors to update this metadata through PRs.

Call for comments: @osallou @bgruening @prvst @BioContainers/contributors @johanneskoester @jmchilton

bgruening commented 6 years ago

Inject an external source of information into the technical recipe with the corresponding disadvantages on that: 1.1. Updates in the identifiers will trigger updates in our recipes, which actually is not a good practice because this is not a technical change.

I don't think we need to rebuild the package in such cases iff the consumers of such metadata rely on the yaml files inside the github repo.

1.2. Identifiers and external sources tends to change more often that other things due major reasons: a) resources change their identifier schema, b) resources disappear or new resources are added, then new identifiers need to be added.

I hope that this does not happen this much. The idea of an ID is that it stays like a doi I hope.

Most of the recipes will be post-annotated. That means that we will have first the recipe and after a process of creation of pubmed or bio.tools, then we will annotate our recipe.

Why is this a cons with regard to the question?

As I said Bioconda will most likely start to annotate tools with DOI very soon in the main meta.yaml file, so we can jump on this and add bio.tools IDs as well.

osallou commented 6 years ago

Biotools ids and others should indeed be fixed in time, with only a few exceptions. That the goal of ids. Managing an other file with other PR may lead to file being updated only in a few cases, people forgetting to do so.

Le lun. 12 févr. 2018 13:20, Björn Grüning notifications@github.com a écrit :

Inject an external source of information into the technical recipe with the corresponding disadvantages on that: 1.1. Updates in the identifiers will trigger updates in our recipes, which actually is not a good practice because this is not a technical change.

I don't think we need to rebuild the package in such cases iff the consumers of such metadata rely on the yaml files inside the github repo.

1.2. Identifiers and external sources tends to change more often that other things due major reasons: a) resources change their identifier schema, b) resources disappear or new resources are added, then new identifiers need to be added.

I hope that this does not happen this much. The idea of an ID is that it stays like a doi I hope.

Most of the recipes will be post-annotated. That means that we will have first the recipe and after a process of creation of pubmed or bio.tools, then we will annotate our recipe.

Why is this a cons with regard to the question?

As I said Bioconda will most likely start to annotate tools with DOI very soon in the main meta.yaml file, so we can jump on this and add bio.tools IDs as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BioContainers/specs/issues/84#issuecomment-364907187, or mute the thread https://github.com/notifications/unsubscribe-auth/AA-gYgjLR1sOj0HK9bT9YYEcP5g6JoHQks5tUCyYgaJpZM4Pyot0 .

hmenager commented 6 years ago

on IDs, I do agree with @bgruening and @osallou : bio.tools identifiers are now persistent IDs. If they are not persistent, it would be better to not provide any ;)

As for the syntax through labels, maybe a distinction between links and ids would be nice, like

LABEL tool_id="bio.tools:comet" (or "https://bio.tools/comet" if URL is preferred)
LABEL training_link="https://www.ebi.ac.uk/training/online/course/phenomenal-accessing-metabolomics-workflows-galaxy"

The point is I am not sure to which extent TESS commits to persistent URLs for instance. Whereas bio.tools surely does.

fjrmoreews commented 6 years ago

on Ids too, When an author puts a new Dockerfile in Biocontainers :

If the tool is not referenced in bio.tools, does he need to create an entry with ID in bio.tools before/after ?

What happens if this is not done ? Do we accept submissions without ID ? Do We need to add it for the user ?

Cheers,

Francois

On Mon, Feb 12, 2018 at 3:28 PM, Hervé Ménager notifications@github.com wrote:

on IDs, I do agree with @bgruening https://github.com/bgruening and @osallou https://github.com/osallou : bio.tools identifiers are now persistent IDs. If they are not persistent, it would be better to not provide any ;)

As for the syntax through labels, maybe a distinction between links and ids would be nice, like

LABEL tool_id="bio.tools:comet" (or "https://bio.tools/comet" if URL is preferred) LABEL training_link="https://www.ebi.ac.uk/training/online/course/phenomenal-accessing-metabolomics-workflows-galaxy"

The point is I am not sure to which extent TESS commits to persistent URLs for instance. Whereas bio.tools surely does.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/BioContainers/specs/issues/84#issuecomment-364938114, or mute the thread https://github.com/notifications/unsubscribe-auth/AOWcQC4V2poKuGn6bDXmZV6AH75GMIz_ks5tUEqNgaJpZM4Pyot0 .

ypriverol commented 6 years ago

We have to define two things here:

1- The way we want to make persistent the id to bio.tools. Different approaches can be implemented: a) we can hard code the ids with some structure in the recipes bioconda/biocontainers. b) add an extra file with the recipe called identifiers.yml (or similar) where we can encode the identifiers. c) separated file when we perform the match between both lists. 2 - The second question is what do we do if the information is not available. We shouldn't force both open source communities to way for bio.tools to release the tool. Howeve, bio.tools can think in a the way that we can implement to request the creation of a tool on demand after. Probably @hmenager has an idea how to do this.

Regards ...

We should focus now the discussion around how to solve problem one. Where to put the identifiers and how in both initiatives Biocontainers Dockerfile and BioConda recipes.

bgruening commented 6 years ago

1a for me and not caring much about 2 for the moment. Time will tell if people will update there own recipes.

fjrmoreews commented 6 years ago

I think 1.a is better (simplicity).

On Mon, Feb 12, 2018 at 5:36 PM, Björn Grüning notifications@github.com wrote:

1a for me and not caring much about 2 for the moment. Time will tell if people will update there own recipes.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/BioContainers/specs/issues/84#issuecomment-364980469, or mute the thread https://github.com/notifications/unsubscribe-auth/AOWcQOioj2SyzXRmfb5qkP0vJSLN61F-ks5tUGh6gaJpZM4Pyot0 .

joncison commented 6 years ago

chaps, I just chip in on a couple of points:

  1. bio.tools toolIDs are indeed persistent. Great effort was made over the Summer to clean-up the supplied tool names on which they're based.
  2. with that said - there are still a few cases (that @hmenager and I are thinking about) which need to be cleaned up: these are cases where online services (e.g. in Galaxy instances) over standalone tools were registered with tool IDs that look like "mytool_mygalaxy". In cases where the tool exists in its own right, these need cleaning up such that an entry with ID "mytool" gets registered (with links to all the places one can access). For purposes of mapping / annotating BioContainers, just bear this in mind.
  3. if tool has been containerised and listed in BioContainers, then we need that tool to be registered in bio.tools. Fortunately, as soon as we complete the migration to biotoolsSchema 3.0.0 (https://github.com/bio-tools/biotoolsSchema/tree/master/versions/biotools-3.0.0-rc) it will be very easy to provide such coverage, as we'll only need tool name, description and homepage URL.
  4. to achieve 3, we need two things 4.1 coverage of existing containerised tools - this should be easy once we complete the BioContainers;bio.tools mapping, & see what's missing (a dump of appropriate metadata from BioContainers will help) 4.2 some sustainable mechanism for alerting bio.tools when a new tool (i.e. one which cannot be mapped) is added to BioContainers, ideally creating the boilerplate entry in bio.tools. This needs a bit of thought.
  5. You could also consider the short (CURIE) form of the bio.tools toolIDs, e.g. biotools:signalp, if you want something more concise then the URL

All for now - happy hacking! :)

ypriverol commented 6 years ago

Hi all:

As we all agree we go for option 1a : Each recipe will be self-described. This means that anyone can take those recipes and find the external reference information. That means that we will put inside each recipe in bioconda \ biocontainers the id of bio.tools. Please let me know with your (+1) in this comment if you agree on this. @osallou @bgruening @prvst @BioContainers/contributors @johanneskoester @jmchilton @joncison @fjrmoreews

osallou commented 6 years ago

+1

Le mer. 14 févr. 2018 17:02, Yasset Perez-Riverol notifications@github.com a écrit :

Hi all:

As we all agree we go for option 1a :Each recipe will be self-described. This means that anyone can take those recipes and find the external reference information.` That means that we will put inside each recipe in bioconda \ biocontainers the id of bio.tools. Please let me know with your (+1) in this comment if you agree on this. @osallou https://github.com/osallou @bgruening https://github.com/bgruening @prvst https://github.com/prvst @BioContainers/contributors https://github.com/orgs/BioContainers/teams/contributors @johanneskoester https://github.com/johanneskoester @jmchilton https://github.com/jmchilton @joncison https://github.com/joncison @fjrmoreews https://github.com/fjrmoreews

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BioContainers/specs/issues/84#issuecomment-365653866, or mute the thread https://github.com/notifications/unsubscribe-auth/AA-gYugQYM3UodyD4h_m_Kicq8GFkk2Aks5tUwOkgaJpZM4Pyot0 .

prvst commented 6 years ago

@ypriverol +1

ypriverol commented 6 years ago

Thanks everyone for your vote on the previous comment. We have decided to encode inside the recipe in Conda and BioContainers Dockerfile the identifiers from bio.tools. Here the different options to encode:

Dockerfile Recipe

LABEL extra.identifier = biotools:abyss
LABEL extra.identifier =  doi:10.1021/ac303239g
LABEL extra.identifier = http://bio.tools/abyss
LABEL extra.identifier = https://pubs.acs.org/doi/10.1021/ac303239g

BioConda Recipe

extra:
  identifiers:
      - biotools:abyss
      - doi:10.1021/ac303239g
      - pmid: 23448308
extra:
  identifiers:
     biotools:
         - http://bio.tools/abyss
         - https://pubs.acs.org/doi/10.1021/ac303239g

It is important to define how this will be represented in both sides. I have linked this comment to an issue in the bioconda community. https://github.com/bioconda/bioconda-recipes/issues/7699 .

joncison commented 6 years ago

Would it hurt to give both?

extra:
  identifiers:
     biotools:
         - biotools:abyss
         - https://bio.tools/abyss
         ...
ypriverol commented 6 years ago

It will not hurt but we should put some standardization for readers to be able to read the files and interpret them. One option could be:

extra: 
   identifiers: 
      - biotools:abyss
      - http://bio.tools/abyss

With this approach, people will know that you can have both compact identifiers and complete URLs.

osallou commented 6 years ago

id is indeed better than URL, which may change in time...

osallou commented 6 years ago

LABEL keys must however be unique, cannot define multiple ones like:

LABEL extra.identifier = biotools:abyss
LABEL extra.identifier =  doi:10.1021/ac303239g

so should be like:

LABEL extra.identifier.biotools=abyss
LABEL extra.identifier.doi=10.1021/ac303239g
joncison commented 6 years ago

in which case you could (if you want both flavours) I guess have:

LABEL extra.identifier.biotools=abyss
LABEL extra.identifier.biotoolsurl=https://bio.tools/abyss
ypriverol commented 6 years ago

@osallou In DockerFile you can define a label multiple times as far as I know. Then, this is allowed:

LABEL extra.identifier = biotools:abyss
LABEL extra.identifier =  doi:10.1021/ac303239g

This is only saying that we have two identifiers for the tool. We can explore other options like , separated values:

LABEL extra.identifier = biotools:abyss, doi:10.1021/ac303239g

@joncison

I don't like to encode in the label/name of the property the domain of identifier because this can open the space to multiple errors.

osallou commented 6 years ago

A label is a key-value pair, stored as a string. You can specify multiple labels for an object, but each key-value pair must be unique within an object. If the same key is given multiple values, the most-recently-written value overwrites all previous values. ( https://docs.docker.com/config/labels-custom-metadata/)

Le jeu. 15 févr. 2018 à 14:25, Yasset Perez-Riverol < notifications@github.com> a écrit :

@osallou https://github.com/osallou In DockerFile you can define a label multiple times as far as I know. Then, this is allowed:

LABEL extra.identifier = biotools:abyssLABEL extra.identifier = doi:10.1021/ac303239g

This is only saying that we have two identifiers for the tool. We can explore other options like , separated values:

LABEL extra.identifier = biotools:abyss, doi:10.1021/ac303239g

@joncison https://github.com/joncison

I don't like to encode in the label/name of the property the domain of identifier because this can open the space to multiple errors.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BioContainers/specs/issues/84#issuecomment-365926247, or mute the thread https://github.com/notifications/unsubscribe-auth/AA-gYp-ieZTqb0L8HVhJgb90_Vgfllm1ks5tVDBUgaJpZM4Pyot0 .

ypriverol commented 6 years ago

Thanks @osallou for sharing this. I have been reading the documentation and it looks like the option is the following for Dockerfiles:

LABEL extra.identifier.biotools=abyss
LABEL extra.identifier.doi=10.1021/ac303239g
bgruening commented 6 years ago

I like the:

extra: 
   identifiers: 
      - biotools:abyss
      - doi:10.1021/ac303239g
      - http://bio.tools/abyss
ypriverol commented 6 years ago

Hi, all the final decision for the external identifiers is the following:

For the bioconda recipe would be like:

extra: 
   identifiers: 
      - biotools:abyss
      - doi:10.1021/ac303239g
      - http://bio.tools/abyss

and the dockerfile would be like the following:

LABEL extra.identifier.biotools=abyss
LABEL extra.identifier.doi=10.1021/ac303239g

Thanks to everyone for their contribution to this discussion.

dansondergaard commented 6 years ago

Thank you for the talk last Monday! I don't know where this discussion moved to, so let me know if you want me to repost this somewhere else.

The ELIXIR Aarhus team has merged our own mapping with the mapping linked to in the beginning of this issue. It is shared here:

https://docs.google.com/spreadsheets/d/1kSBnt6CKG53mqsltTzA-gCFdEuBSqf-UQwQoukD3lwM/edit?usp=sharing

As @joncison requested, the mapping also contains entries for tools which have a Bioconda package, but not an entry in bio.tools.

You guys should be able to edit the mapping, so we can collaborate on it. Or do you want to do it in some other way? @ypriverol, @hmenager?

ypriverol commented 6 years ago

@dansondergaard :

As we agree, I did a PR to conda group to agree with them on the structure https://github.com/bioconda/bioconda-recipes/pull/7940 . The PR is now under consideration and I hope we can have an agreement and accept the PR by the end of the week.

About the mapping list. First thanks a lot for this great work. @hmenager and myself have created a repo (https://github.com/BioContainers/biotools-bioconda-ids) in git with the list of containers and the corresponding tool in biotools if is available.

These are the files:

My idea is to build a simple script when the BioConda team accept the structure I have proposed to annotate all the tools in the matching list. I have already the code in place for that.

What do you think ? @dansondergaard

dansondergaard commented 6 years ago

@ypriverol Sounds good. I know about the repo (since we used it to merge your mapping with ours to obtain a more complete mapping).

However, we'd like to keep working on the mapping, possibly in collaboration with you guys, which I think may be easier to do in Google Docs (otherwise there'll probably be a lot of merge conflicts). But if you prefer to collaborate via Git that's fine too :-)

ypriverol commented 6 years ago

Hi @dansondergaard It would be great if everything is on github, for example, yesterday Dimitri did a PR with some updates after a manual curation of the web in biotools. We can merge both lists here in github and add the extra column verification_status. If you feel comfortable with this idea, please make a PR and I will accept it.

dansondergaard commented 6 years ago

Working on GitHub is fine. We'll attempt a PR soon.

Just another question. In your mapping, what does "null" mean? Does it mean:

  1. that you weren't able to automatically map the bioconda package name to a bio.tools identifier?
  2. that you manually verified that the bioconda package does not have a corresponding entry in bio.tools?
  3. that either (1) or (2) is the case?

Our worry is that we're going to do a lot of extra manual work checking "null" entries, if you guys already did it. If we could distinguish "verified not in bio.tools" and "did not map automatically", @joncison would also be able to create the missing bio.tools entries.

joncison commented 6 years ago

yup ... above would make life much easier. @hansioan and I can help out with 1. and 2. above: heads-up Hans, there will be a curation task in due course, although we'll be expecting to get at least some boilerplate metadata from the BioContainers side.