AVR-biosecurity-bioinformatics / mimir

A Nextflow pipeline to curate DNA reference databases for metabarcoding
0 stars 0 forks source link

Utilising 'non-standard' ranks in BOLD and GenBank #1

Open jackscanlan opened 1 week ago

jackscanlan commented 1 week ago

Currently, the pipeline uses NCBI's rankedlineage.dmp taxdump file for taxonomic standardisation, which only has the ranks kingdom, phylum, class, order, family, genus and species. However, the full taxonomic lineages of most NCBI sequences contain ranks outside of these 7, so some taxonomic information is being lost.

Additionally, BOLD's taxonomy system includes subfamily, tribe and subspecies as ranks for some sequences, and for some sequences, their lowest assignment is to one of the former two ranks. Fitting these into the 7 ranks would mean a sequence identified down to tribe would be forced to only be assigned to family level, which is significantly higher.

In theory you could just increase the number of ranks for the database, but as many sequences are unassigned at, for example, tribe, you could either:

  1. create a dummy taxon for the highest assignment level below the rank: e.g. tribe = NA, genus = Drosophila becomes --> tribe = tribe_Drosophila, genus = Drosophila
  2. create an 'unassigned' label for all sequences at a rank that is missing: e.g. all sequences without a tribe assignment get tribe = unassigned regardless of their higher taxonomy

Both solutions raise their own issues for taxonomic assignment. Can IDTAXA or other assignment methods handle missing ranks or dummy taxa? It's unclear although has been discussed in passing: https://github.com/benjjneb/dada2/issues/853

jackscanlan commented 1 week ago

For now, the pipeline will remove additional ranks to fit the NCBI rankedlinkage.dmp ranks. But this is something to look at in the future.