Question on use of OTU related terms

GenomicsStandardsConsortium / mixs

Minimum Information about any (X) Sequence” (MIxS) specification

https://w3id.org/mixs

Creative Commons Zero v1.0 Universal

37 stars 21 forks source link

Question on use of OTU related terms #603

Open LynnDelgat opened 1 year ago

LynnDelgat commented 1 year ago

When trying to figure out how to document metabarcoding data, I came across some things that were not super clear to me. Any insight in this question is much appreciated, and perhaps the terms could be updated to be clarified if needed.

The standard contains 3 different fields related to OTUs/clustering: otu_class_appr: Cutoffs and approach used when clustering “species-level” OTUs. otu_seq_comp_appr: Tool and thresholds used to compare sequences when computing "species-level" OTUs otu_db: Reference database (i.e. sequences not generated as part of the current study) used to cluster new genomes in "species-level" OTUs, if any

I am not sure on how otu_seq_comp_appr differs from otu_class_appr? Both terms seem to be about cutoffs and approach used to cluster OTUs (assuming approach = tool, thresholds = cutoffs, and clustering OTUs = computing OTUs). The only real difference between the 2 definitions seems the addition of "used to compare sequences" for otu_seq_comp_appr , but to me this is not clear what is meant by this.

Does it mean to compare reads to an external reference when clustering? As in this term would be used in (closed/open-)reference OTU clustering, while otu_class_appr would be used in de novo OTU clustering (and also open-reference)? If so, would "Tool and thresholds used to compare reads to a reference database when computing "species-level" OTUs" be a good definition for this term? (and perhaps adding "de novo" to the otu_class_appr definition). And if this is the case, I assume otu_seq_comp_appr should always be used together with a field specifying the reference db? Could the definition of otu_db then be changed from "genomes" to "reads", or something else more broadly applicable? Or alternatively, is otu_seq_comp_appr meant to be used to define the tool and thresholds to taxonomically annotate sequences? If so, should the definition be updated to something like "Tool and thresholds used to compare sequences when taxonomically annotating/ Tool and thresholds used for taxonomic annotation"? (which could then apply to both OTUs and ASVs)

Also, these 3 terms are found in the MIUVIG checklist, it would make sense to me to add them to the MIMARKSSurvey checklist (and perhaps MIMARKSSpecimen?) as well.

only1chunts commented 1 year ago

Thank you for your suggestions, I am away at the moment, but someone will add some labels to this ticket to ensure we discuss and resolve this query at our next monthly call (you are also welcome to attend, I can share details later). The resolution will be added here.

On Thu, 10 Aug 2023, 20:23 LynnDelgat, @.***> wrote:

When trying to figure out how to document metabarcoding data, I came across some things that were not super clear to me. Any insight in this question is much appreciated, and perhaps the terms could be updated to be clarified if needed.

The standard contains 3 different fields related to OTUs/clustering: otu_class_appr: Cutoffs and approach used when clustering “species-level” OTUs. otu_seq_comp_appr: Tool and thresholds used to compare sequences when computing "species-level" OTUs otu_db: Reference database (i.e. sequences not generated as part of the current study) used to cluster new genomes in "species-level" OTUs, if any

I am not sure on how otu_seq_comp_appr differs from otu_class_appr? Both terms seem to be about cutoffs and approach used to cluster OTUs (assuming approach = tool, thresholds = cutoffs, and clustering OTUs = computing OTUs). The only real difference between the 2 definitions seems the addition of "used to compare sequences" for otu_seq_comp_appr , but to me this is not clear what is meant by this.

Does it mean to compare reads to an external reference when clustering? As in this term would be used in (closed/open-)reference OTU clustering, while otu_class_appr would be used in de novo OTU clustering (and also open-reference)? If so, would "Tool and thresholds used to compare reads to a reference database when computing "species-level" OTUs" be a good definition for this term? (and perhaps adding "de novo" to the otu_class_appr definition). And if this is the case, I assume otu_seq_comp_appr should always be used together with a field specifying the reference db? Could the definition of otu_db then be changed from "genomes" to "reads", or something else more broadly applicable? Or alternatively, is otu_seq_comp_appr meant to be used to define the tool and thresholds to taxonomically annotate sequences? If so, should the definition be updated to something like "Tool and thresholds used to compare sequences when taxonomically annotating/ Tool and thresholds used for taxonomic annotation"? (which could then apply to both OTUs and ASVs)

Also, these 3 terms are found in the MIUVIG checklist, it would make sense to me to add them to the MIMARKSSurvey checklist (and perhaps MIMARKSSpecimen?) as well.

— Reply to this email directly, view it on GitHub https://github.com/GenomicsStandardsConsortium/mixs/issues/603, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOB5GIBRCM6JSM6SHDEWB3XUTOGVANCNFSM6AAAAAA3LQCJPA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

LynnDelgat commented 1 year ago

@only1chunts Thank you. I noticed no labels were added yet, but I don't know when the next monthly call is? (would be happy to attend if I can)

only1chunts commented 1 year ago

Regularly scheduled calls take place on the 4th TUESDAY of each month, at 8AM PT/11AM ET/4PM BST/5PM CET. (the next one is due 26th Sept). We send out notifications of meetings (including zoom link) to the CIG google group, you can join the group here: https://groups.google.com/u/2/g/gensc-cig

only1chunts commented 1 year ago

FAO CIG members (inc @lschriml @ramonawalls @mslarae13) :

These are the current details of the 3 terms in question:

Structured comment name	Item (rdfs:label)	Definition	Expected value	Value syntax	Example	MIXS ID
otu_class_appr	OTU classification approach	Cutoffs and approach used when clustering “species-level” OTUs. Note that results from standard 95% ANI / 85% AF clustering should be provided alongside OTUS defined from another set of thresholds, even if the latter are the ones primarily used during the analysis	cutoffs and method used	{ANI cutoff};{AF cutoff};{clustering method}	95% ANI;85% AF; greedy incremental clustering	MIXS:0000085
otu_seq_comp_appr	OTU sequence comparison approach	Tool and thresholds used to compare sequences when computing "species-level" OTUs	software name, version and relevant parameters	{software};{version};{parameters}	blastn;2.6.0+;e-value cutoff: 0.001	MIXS:0000086
otu_db	OTU database	Reference database (i.e. sequences not generated as part of the current study) used to cluster new genomes in "species-level" OTUs, if any	database and version	{database};{version}	NCBI Viral RefSeq;83	MIXS:0000087

Possible actions:

1 - to align with other terms we should consider changing the name to include "method" instead of "approach" 2 - definitions need to be more generic (and understandable to everyone) and suitable for use in other checklists 3 - where acronyms are used they should be spelt out (what is ANI or AF?) 4 - consider which other checklists should include these as expected or conditional mandatory (MIMARKS?)

Just out of curiosity, @turbomam are you able to provide some numbers of how many times each of these 3 terms have been used in the BioSamples database please?

only1chunts commented 1 year ago

Notes from CIG call 26 Sep 2023

LinkML can include alias's so changing a name or adding in an alternative name is possible moving forwards
one of the originators was Simon Roux at LMB so we should loop them in to get input on the matter

turbomam commented 1 year ago

I don't see any appearances of

otu_class_appr
otu_seq_comp_appr
otu_db

in the harmonized_names and attribute_names from NCBI Biosample, downloaded July 2023.

only1chunts commented 11 months ago

Email communication from @simroux (one of the original creators of the MIUVIG checklist):

Thanks for getting in touch, and yes I was part of the MIUViG team so happy to help if I can here. The way we defined these three terms was (to my understanding) as follows:

otu_seq_comp_appr: This is for the specific tool used when performing the all-vs-all sequence comparison to build OTUs. For MIUViG, vOTUs are based on genome-wide comparison, so people can use e.g. blastn to compare their genome, or they can use e.g. MUMMER, LAST, etc. Each of these tools come with their own cutoffs / parameters, which will have an impact on the downstream process, so that's what we intended to capture here

otu_class_appr: This is for the tool and cutoff used when processing the results of these pairwise comparisons to build vOTUs, i.e. which clustering algorithm / logic, and what cutoffs were used to define a vOTU.

I'm not sure how these apply to amplicon work, I suspect the latter could be reused but the former does not really apply because, as far as I understand, you don't really need to perform all-vs-all alignment for amplicons (since you know you are working with a defined sequence, you can compute identity percentage directly ?).

Now to be perfectly honest, I don't know that these terms have been used, these were defined by the MIUViG working group as "it would be really nice if people included this information", but I think I remember these are optional, so not sure how many people filled this in. I would also argue these are relatively "minor" terms, i.e. if they were to be modified, it would not fundamentally change the relevance or usage of the MIUViG checklist.

Those comments together with the numbers (or lack there of!) provided by @turbomam above(thanks), I think its safe to make changes to these term in whatever way we see fit to enable them to be understood and used by a wider audience.

From my understanding of it: otu_seq_comp_appr refers to the initial identification of the virus OTUs from the set of assembled OTUs otu_class_appr refers to the methods used to cluster those identified virus OTUs.

So I agree otu_class_appr is directly applicable to amplicon sequencing so could be included in the MIMARKS checklists, but given there is no need to identify the amplicon (as only the amplicons will have been sequenced) then otu_seq_comp_appr doesn't seem to apply to MIMARKS. However, if someone was to go looking for say 16s rRNA amplicons within a metagenome then it could be applicable to MIMS (maybe?). Similarly, the otu_db term could be applicable to MIMARKS studies, with examples of GreenGenes, SILVA etc

LynnDelgat commented 10 months ago

@only1chunts Thank you very much for the clarification! So if I try to summarize from the metabarcoding/MIMARKS standpoint: otu_seq_comp_appr: not applicable otu_class_appr: clustering tool and parameters otu_db: reference database used for clustering (so only applicable to (closed/open-)reference OTU clustering, not de novo OTU clustering)

If my understanding above is correct, I would propose to adapt the definitions to make the terms more generally applicable. A possible suggestion:

Structured comment name | Current definition | Proposed definition | Note -- | -- | -- | -- otu_class_appr | Cutoffs and approach used when clustering “species-level” OTUs. Note that results from standard 95% ANI / 85% AF clustering should be provided alongside OTUS defined from another set of thresholds, even if the latter are the ones primarily used during the analysis | Cutoffs and approach used when clustering reads. | "Cutoffs and approach" could also be replaced by "tool and parameters" or by "method", whichever is most appropriate to align with other terms. otu_db | Reference database (i.e. sequences not generated as part of the current study) used to cluster new genomes in "species-level" OTUs, if any | Reference database (i.e. sequences not generated as part of the current study) used to cluster reads, if any

Let me know if this makes sense, and if I need to create a separate issue for the adaptation of the definition of these terms.

sformel-usgs commented 2 months ago

Sorry for the long delay on following up on this! I agree that simple tweaks should be made so that this issue can be closed, but that we need to consider generalizing these terms to include denoising, as discussed in #746. I'm not sure if a separate issue is needed to officially adapt the definitions, but I'm in favor of the definitions proposed by @LynnDelgat above. Additional context below.

From an eDNA/metabarcoding community perspective, these terms might not be represented in the NCBI Biosamples, but they are heavily discussed as identical terms in the DwC extension, and the subsequent GBIF/OBIS guidance on that extension. That being said, making changes "to enable them to be understood and used by a wider audience", is aligned with the eDNA community discussion.

The definitions suggested by @LynnDelgat above work, but it's important to know that the term otu_seq_comp_appr is in use for metabarcoding through the DwC extension and GBIF/OBIS guidance. It's unclear to what degree this is analogous to tax_class, so that might be another point of confusion here. @miwa582 Perhaps this could be addressed in the new eDNA checklist development?

These are the terms, examples, and descriptions included in the GBIF/OBIS guidance. I'm taking an action for myself to get the extension and guidance matching, since the guidance seems to ignore the viral component in the definitions and doesn't aknowldege the tax_class field.

Term	Example	Description
otu_class_appr	"dada2; 1.14.0; ASV"	Approach/algorithm and clustering level (if relevant) when defining OTUs or ASVs
otu_seq_comp_appr	"blastn;2.6.0+;e-value cutoff: 0.001"	Tool and thresholds used to assign "species-level" names to OTUs or ASVs
otu_db	"Genbank nr;221", "UNITE;8.2"	Reference database (i.e. sequences not generated as part of the current study) used to assigning taxonomy to OTUs or ASVs

turbomam commented 2 months ago

Thanks @sformel-usgs. I appreciate the holistic perspective.

I am opposed to terms that take semi-colon composed values like those examples, unless the submitter provides validation patterns. I can help with that but can't do it on my own.

Or we could just go on the record and flag some MIxS terms as being "not intended for searching or filtering". Meaning they might be useful to somebody who already found samples through some other search strategy, and they may add some additional understanding on top of that, but they are not intended as a first pass filter.

LynnDelgat commented 2 months ago

@sformel-usgs Thanks for your input! Indeed, the GBIF guidelines seem to not match with the MiXS definitions in certain cases, which is what triggered me to open this issue so that we can hopefully align them. My idea was to suggest a change in the mapping of the GBIF guidelines, so that the GBIF recommendations would match the MiXS definitions, once the definitions of the terms were clarified here. (However, I am not sure how to deal with the change in use of otu_seq_comp_appr that would result from that.)

sformel-usgs commented 2 months ago

@turbomam I understand your preference to avoid the semi-colon composed values. I always imagined these terms were a bit of a stopgap to get people to record something and I think prescribing more structure would be an improvement. Given @LynnDelgat's comments above, #746, the eDNA checklist development group, and my own experience, it seems like the time is ripe to tackle breaking these terms into a structured set of terms with input from both the GSC and the TDWG communities.

But, first we should resolve the discussion about definitions. @LynnDelgat I think your plan is a good one, and I don't think we need to worry about the change in use of otu_seq_comp_appr. Since the definition won't be changing, and it doesn't appear in the NCBI BioSamples, the only ripples would be including the term in additional checklists than Miuvig and making sure that the guidance is aligned between the GSC and TDWG communities. I don't think the GBIF guidance is a misuse of the term beyond ignoring that it was restricted to viral genomes.

So, like you suggested above, we should:

Get the definitions revised (assuming others agree with @LynnDelgat's proposed definitions)
Ask the TDWG folks to revise the definitions in the extension.
Work with GBIF/TDWG to align the GBIF guidance.
Discuss revising the prescribed semi-colon separated values, possibly by decomposing the term into a structured set of terms.

What do you think?