gbif / doc-publishing-dna-derived-data

This guide shows how to publish DNA-derived spatiotemporal biodiversity data and make it discoverable through national and global biodiversity data discovery platforms. Based on experiences from Australia, Norway, Sweden, UNITE, and GBIF.
https://doi.org/10.35035/doc-vf1a-nr22
Other
2 stars 7 forks source link

Taxonomic annotation context #197

Closed pieterprovoost closed 10 months ago

pieterprovoost commented 10 months ago

As most if not all taxonomic annotation algorithms require setting a number of arbitrary thresholds on alignment length, percent identity, confidence level, etc, we thought it could be useful to provide a bit more context to the user by adding more detailed results from the algorithms used in identificationRemarks. This could for example include confidence scores at all taxonomic levels from RDP classifier, or percent identity for multiple hits from VSEARCH. The current example in Table 2 only has a single confidence score for the selected identification. Would it be appropriate to add this information to identificationRemarks like this, or are there better options?

Identification based on RDP classifier with confidence cutoff 0.8, VSEARCH against MIDORI_UNIQ_GB246_CO1 for verification. RDP classifier confidence: sk:Eukaryota 1.0, k:Metazoa 1.0, p:Chordata 1.0, c:Actinopteri 1.0, o:Cypriniformes 1.0, f:Cyprinidae 1.0, g:Cyprinus 1.0, s:Cyprinus_carpio 0.88; VSEARCH identity: MK843656 s:Cyprinus_carpio 100.0, MK843657 s:Cyprinus_carpio 100.0, MF805658 s:Cyprinus_rubrofuscus 100.0, KJ135626 s:Pseudorasbora_parva 100.0

tobiasgf commented 10 months ago

Yes. It is a good idea to give a richer example like the one you give there. It is also a practice by some publisher to do like that, like in this dataset (occurrence)