GenomicsStandardsConsortium / mixs

Minimum Information about any (X) Sequence” (MIxS) specification
https://w3id.org/mixs
Creative Commons Zero v1.0 Universal
37 stars 21 forks source link

Add term for denoising method #746

Open LynnDelgat opened 9 months ago

LynnDelgat commented 9 months ago

New term details

Term name - Denoising approach
Structured comment name - denoising_appr
Definition - Tool and parameters used to denoise sequence reads.
Expected value - algorithm name, version and relevant parameters
Value syntax - {algorithm name and version};{parameter name 1:parameter value 1, parameter name 2:parameter value 2,...}
Example - UNOISE3;alpha:2
Preferred unit - NA
Extension(s) - MimarksS, MimarksC
Relationship to other MIXS terms - NA

Additional context Denoising is a widely used method in (meta)barcoding and is an essential step in the bioinformatics processing pipeline. A specific term to document this step is currently missing. More information can also be found in this GitHub issue: https://github.com/gbif/doc-publishing-dna-derived-data/issues/147

LynnDelgat commented 3 months ago

Alternatively, "otu_class_appr" could be made more inclusive (see also https://github.com/GenomicsStandardsConsortium/mixs/issues/603), to not only clustering methods, but also denoising methods (e.g. "Tool and parameteres used when clustering and/or denoising reads."). In that case, it would be useful to have a term where the broad category of the method (clustering vs. denoising) can be indicated with a controlled vocabulary so that users can easily distinguish whether a clustering or denoising method was used. (A disadvantage however to using "otu_class_appr" also for denoising methods is that the name would be quite misleading since it contains otu.)

sformel-usgs commented 2 months ago

As I think about data re-use, I'm with @LynnDelgat on this one. We should create a new term that captures both methods of grouping reads, without the misleading 'otu' prefix.

The point of this information is to understand how the reads were grouped so that a representative sequence could be analyzed. No doubt there will be new methods in the future, and I think the best way forward is a term that acts as a coarse filter (the controlled vocab mentioned above) and then additional terms that capture the nuanced variation of the methods.

turbomam commented 2 months ago

@LynnDelgat can you please give some more examples of valid and invalid values for this proposed denoising_appr term?

turbomam commented 2 months ago

I am concerned about the ... part of the parameter name & value pattern, and the fact that there are no constraints on the 'algorithm name and version' or the 'parameter name'.

Do we aspire that different submitters using this new term will populate it in the same way, to enable meaningful searches and groupings?

LynnDelgat commented 2 months ago

@turbomam To enable meaningful searches and groupings, it would probably be easier to split each component into separate fields, but I suggested this term in analogy with other existing terms, which all seem to group software, version and parameters in one field. From a personal viewpoint, this field (or otu_class_appr if we decide to make that one more inclusive) is meant to document provenance/methodology as will always be difficult to make meaningful searches on it since people could write the algorithm name or parameter names however they like (in absence of controlled vocabularies for them). For a field to filter on, one describing the broad category of the method with a controlled vocabulary, would suffice for our intended use. However, other people might need more detailed searches of course. I'll try to be more clear so that a constraining pattern could be added if needed (though I am not sure if I am the best person to determine this): Revised value syntax - {algorithm name};{versionnumber};({parameter name 1:parameter value 1,parameter name 2:parameter value 2,...,parameter name n:parameter value n}|{"default parameters"}) Examples:

So the proposed pattern would be something like: Any character any number of times (min.1) ";" any character any number of times (min.1) ";" and then "default parameters" or any number of repetitions (min. 1) of: any character any number of times (min.1) ":" any character any number of times (min.1) separated by "," between repetitions. But I don't know if that's not too restrictive, because if a data provider is only willing to/ can only provide the algorithm name, we would still like to be able to capture that. So probably these should also be allowed:

turbomam commented 2 months ago

Thanks @LynnDelgat for the additional valid examples. I agree that there is a strong precedent for pseudo-specifications, like you provided, in MIxS, and I appreciate your effort at consistency. And I don't think you are accountable for solving these problems. Ultimately, I'm responding to this proposal for other technical implementers to review.

Having said that, I see this kind of specification to be one of MIxS' greatest weaknesses. These terms are not at all machine actionable, and in my experience they aren't very useful for human review either. I can give some examples from the INSDC Biosample records if you want.

So, if you're interested in discussing this more, my next questions would be