Open LynnDelgat opened 1 year ago
Thank you for your suggestions, I am away at the moment, but someone will add some labels to this ticket to ensure we discuss and resolve this query at our next monthly call (you are also welcome to attend, I can share details later). The resolution will be added here.
On Thu, 10 Aug 2023, 20:23 LynnDelgat, @.***> wrote:
When trying to figure out how to document metabarcoding data, I came across some things that were not super clear to me. Any insight in this question is much appreciated, and perhaps the terms could be updated to be clarified if needed.
The standard contains 3 different fields related to OTUs/clustering: otu_class_appr: Cutoffs and approach used when clustering “species-level” OTUs. otu_seq_comp_appr: Tool and thresholds used to compare sequences when computing "species-level" OTUs otu_db: Reference database (i.e. sequences not generated as part of the current study) used to cluster new genomes in "species-level" OTUs, if any
I am not sure on how otu_seq_comp_appr differs from otu_class_appr? Both terms seem to be about cutoffs and approach used to cluster OTUs (assuming approach = tool, thresholds = cutoffs, and clustering OTUs = computing OTUs). The only real difference between the 2 definitions seems the addition of "used to compare sequences" for otu_seq_comp_appr , but to me this is not clear what is meant by this.
Does it mean to compare reads to an external reference when clustering? As in this term would be used in (closed/open-)reference OTU clustering, while otu_class_appr would be used in de novo OTU clustering (and also open-reference)? If so, would "Tool and thresholds used to compare reads to a reference database when computing "species-level" OTUs" be a good definition for this term? (and perhaps adding "de novo" to the otu_class_appr definition). And if this is the case, I assume otu_seq_comp_appr should always be used together with a field specifying the reference db? Could the definition of otu_db then be changed from "genomes" to "reads", or something else more broadly applicable? Or alternatively, is otu_seq_comp_appr meant to be used to define the tool and thresholds to taxonomically annotate sequences? If so, should the definition be updated to something like "Tool and thresholds used to compare sequences when taxonomically annotating/ Tool and thresholds used for taxonomic annotation"? (which could then apply to both OTUs and ASVs)
Also, these 3 terms are found in the MIUVIG checklist, it would make sense to me to add them to the MIMARKSSurvey checklist (and perhaps MIMARKSSpecimen?) as well.
— Reply to this email directly, view it on GitHub https://github.com/GenomicsStandardsConsortium/mixs/issues/603, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOB5GIBRCM6JSM6SHDEWB3XUTOGVANCNFSM6AAAAAA3LQCJPA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@only1chunts Thank you. I noticed no labels were added yet, but I don't know when the next monthly call is? (would be happy to attend if I can)
Regularly scheduled calls take place on the 4th TUESDAY of each month, at 8AM PT/11AM ET/4PM BST/5PM CET. (the next one is due 26th Sept). We send out notifications of meetings (including zoom link) to the CIG google group, you can join the group here: https://groups.google.com/u/2/g/gensc-cig
FAO CIG members (inc @lschriml @ramonawalls @mslarae13) :
These are the current details of the 3 terms in question:
Structured comment name | Item (rdfs:label) | Definition | Expected value | Value syntax | Example | MIXS ID |
---|---|---|---|---|---|---|
otu_class_appr | OTU classification approach | Cutoffs and approach used when clustering “species-level” OTUs. Note that results from standard 95% ANI / 85% AF clustering should be provided alongside OTUS defined from another set of thresholds, even if the latter are the ones primarily used during the analysis | cutoffs and method used | {ANI cutoff};{AF cutoff};{clustering method} | 95% ANI;85% AF; greedy incremental clustering | MIXS:0000085 |
otu_seq_comp_appr | OTU sequence comparison approach | Tool and thresholds used to compare sequences when computing "species-level" OTUs | software name, version and relevant parameters | {software};{version};{parameters} | blastn;2.6.0+;e-value cutoff: 0.001 | MIXS:0000086 |
otu_db | OTU database | Reference database (i.e. sequences not generated as part of the current study) used to cluster new genomes in "species-level" OTUs, if any | database and version | {database};{version} | NCBI Viral RefSeq;83 | MIXS:0000087 |
1 - to align with other terms we should consider changing the name to include "method" instead of "approach" 2 - definitions need to be more generic (and understandable to everyone) and suitable for use in other checklists 3 - where acronyms are used they should be spelt out (what is ANI or AF?) 4 - consider which other checklists should include these as expected or conditional mandatory (MIMARKS?)
Just out of curiosity, @turbomam are you able to provide some numbers of how many times each of these 3 terms have been used in the BioSamples database please?
Notes from CIG call 26 Sep 2023
I don't see any appearances of
otu_class_appr
otu_seq_comp_appr
otu_db
in the harmonized_name
s and attribute_name
s from NCBI Biosample, downloaded July 2023.
Email communication from @simroux (one of the original creators of the MIUVIG checklist):
Thanks for getting in touch, and yes I was part of the MIUViG team so happy to help if I can here. The way we defined these three terms was (to my understanding) as follows:
- otu_seq_comp_appr: This is for the specific tool used when performing the all-vs-all sequence comparison to build OTUs. For MIUViG, vOTUs are based on genome-wide comparison, so people can use e.g. blastn to compare their genome, or they can use e.g. MUMMER, LAST, etc. Each of these tools come with their own cutoffs / parameters, which will have an impact on the downstream process, so that's what we intended to capture here
- otu_class_appr: This is for the tool and cutoff used when processing the results of these pairwise comparisons to build vOTUs, i.e. which clustering algorithm / logic, and what cutoffs were used to define a vOTU.
I'm not sure how these apply to amplicon work, I suspect the latter could be reused but the former does not really apply because, as far as I understand, you don't really need to perform all-vs-all alignment for amplicons (since you know you are working with a defined sequence, you can compute identity percentage directly ?).
Now to be perfectly honest, I don't know that these terms have been used, these were defined by the MIUViG working group as "it would be really nice if people included this information", but I think I remember these are optional, so not sure how many people filled this in. I would also argue these are relatively "minor" terms, i.e. if they were to be modified, it would not fundamentally change the relevance or usage of the MIUViG checklist.
Those comments together with the numbers (or lack there of!) provided by @turbomam above(thanks), I think its safe to make changes to these term in whatever way we see fit to enable them to be understood and used by a wider audience.
From my understanding of it:
otu_seq_comp_appr
refers to the initial identification of the virus OTUs from the set of assembled OTUs
otu_class_appr
refers to the methods used to cluster those identified virus OTUs.
So I agree otu_class_appr
is directly applicable to amplicon sequencing so could be included in the MIMARKS checklists, but given there is no need to identify the amplicon (as only the amplicons will have been sequenced) then otu_seq_comp_appr
doesn't seem to apply to MIMARKS. However, if someone was to go looking for say 16s rRNA amplicons within a metagenome then it could be applicable to MIMS (maybe?).
Similarly, the otu_db
term could be applicable to MIMARKS studies, with examples of GreenGenes, SILVA etc
@only1chunts Thank you very much for the clarification! So if I try to summarize from the metabarcoding/MIMARKS standpoint: otu_seq_comp_appr: not applicable otu_class_appr: clustering tool and parameters otu_db: reference database used for clustering (so only applicable to (closed/open-)reference OTU clustering, not de novo OTU clustering)
If my understanding above is correct, I would propose to adapt the definitions to make the terms more generally applicable. A possible suggestion:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
Structured comment name | Current definition | Proposed definition | Note -- | -- | -- | -- otu_class_appr | Cutoffs and approach used when clustering “species-level” OTUs. Note that results from standard 95% ANI / 85% AF clustering should be provided alongside OTUS defined from another set of thresholds, even if the latter are the ones primarily used during the analysis | Cutoffs and approach used when clustering reads. | "Cutoffs and approach" could also be replaced by "tool and parameters" or by "method", whichever is most appropriate to align with other terms. otu_db | Reference database (i.e. sequences not generated as part of the current study) used to cluster new genomes in "species-level" OTUs, if any | Reference database (i.e. sequences not generated as part of the current study) used to cluster reads, if any
When trying to figure out how to document metabarcoding data, I came across some things that were not super clear to me. Any insight in this question is much appreciated, and perhaps the terms could be updated to be clarified if needed.
The standard contains 3 different fields related to OTUs/clustering: otu_class_appr: Cutoffs and approach used when clustering “species-level” OTUs. otu_seq_comp_appr: Tool and thresholds used to compare sequences when computing "species-level" OTUs otu_db: Reference database (i.e. sequences not generated as part of the current study) used to cluster new genomes in "species-level" OTUs, if any
I am not sure on how otu_seq_comp_appr differs from otu_class_appr? Both terms seem to be about cutoffs and approach used to cluster OTUs (assuming approach = tool, thresholds = cutoffs, and clustering OTUs = computing OTUs). The only real difference between the 2 definitions seems the addition of "used to compare sequences" for otu_seq_comp_appr , but to me this is not clear what is meant by this.
Does it mean to compare reads to an external reference when clustering? As in this term would be used in (closed/open-)reference OTU clustering, while otu_class_appr would be used in de novo OTU clustering (and also open-reference)? If so, would "Tool and thresholds used to compare reads to a reference database when computing "species-level" OTUs" be a good definition for this term? (and perhaps adding "de novo" to the otu_class_appr definition). And if this is the case, I assume otu_seq_comp_appr should always be used together with a field specifying the reference db? Could the definition of otu_db then be changed from "genomes" to "reads", or something else more broadly applicable? Or alternatively, is otu_seq_comp_appr meant to be used to define the tool and thresholds to taxonomically annotate sequences? If so, should the definition be updated to something like "Tool and thresholds used to compare sequences when taxonomically annotating/ Tool and thresholds used for taxonomic annotation"? (which could then apply to both OTUs and ASVs)
Also, these 3 terms are found in the MIUVIG checklist, it would make sense to me to add them to the MIMARKSSurvey checklist (and perhaps MIMARKSSpecimen?) as well.