Fix BLAST results protein to taxonomic accession assignment

KwanLab / Autometa

Autometa: Automated Extraction of Genomes from Shotgun Metagenomes

Other

40 stars 15 forks source link

Currently the documentation instructs and the code downloads prot.accession2taxid.gz which doesn't have all of the nr accessions. Proteins that aren't found in prot.accession2taxid.gz are assigned to root which results in contigs becoming unclassified. Currently this is ameliorated by using prot.accession2taxid.FULL.gz instead of prot.accession2taxid.gz, as shown below. But the code needs to be changed to handle missing accessions. Per our meeting today these should probably be assigned to None and then should be dropped before handing over to LCA.

Assignment to root that needs to be changed: https://github.com/KwanLab/Autometa/blob/baf61c04dddf5b33bb825dba2841de1e38dffefe/autometa/taxonomy/ncbi.py#L453-L457

# discard root taxid from set of query's taxids before #L370 qseqid_taxids.discard(root_taxid) # or something like this? from autometa.taxonomy.database import TaxonomyDatabase qseqid_taxids.discard(TaxonomyDatabase.UNCLASSIFIED_TAXID)

KwanLab / Autometa

Fix BLAST results protein to taxonomic accession assignment #317