AVR-biosecurity-bioinformatics / mimir

A Nextflow pipeline to curate DNA barcode reference databases for metabarcoding analyses
0 stars 0 forks source link

Internal sequence inputs #2

Open jackscanlan opened 2 weeks ago

jackscanlan commented 10 hours ago

Thoughts on how to format and match taxonomy of internal sequences.

Input options:

  1. [accession]|[internal taxid];[lineage string], eg. ABC|123;Kingdom;Phylum;Class;Order;Family;Genus;Species
  2. [accession]|[NCBI taxid];[lineage string], eg. ABC|NCBI:123;Kingdom;Phylum;Class;Order;Family;Genus;Species
  3. [accession]|[NCBI taxid], eg. ABC|NCBI:123
  4. [accession]|[parent NCBI taxid];[shortened lineage string], eg. ABC|PARENT:123;NewGenus;NewSpecies (in this example, parent is at family level)
  5. Any other format -- considered invalid

Reformatting/parsing:

  1. Taxid type added: ABC|123;Kingdom;Phylum;Class;Order;Family;Genus;Species >>> ABC|INTERNAL:123;Kingdom;Phylum;Class;Order;Family;Genus;Species
  2. Lineage string checked against NCBI taxonomy and replaced if incorrect (which is recorded)
  3. Lineage string added directly from NCBI taxonomy
  4. Lineage string from the parent taxid is combined with new lineage information; taxid is retained in existing format
  5. Sequences are removed -- pipeline parameter determines if this results in a terminating error or a recorded warning

This means there are four valid taxid types for the entire pipeline: