KwanLab / Autometa

Autometa: Automated Extraction of Genomes from Shotgun Metagenomes
https://autometa.readthedocs.io
Other
40 stars 15 forks source link

Fix BLAST results protein to taxonomic accession assignment #317

Open chasemc opened 1 year ago

chasemc commented 1 year ago

Currently the documentation instructs and the code downloads prot.accession2taxid.gz which doesn't have all of the nr accessions. Proteins that aren't found in prot.accession2taxid.gz are assigned to root which results in contigs becoming unclassified. Currently this is ameliorated by using prot.accession2taxid.FULL.gz instead of prot.accession2taxid.gz, as shown below. But the code needs to be changed to handle missing accessions. Per our meeting today these should probably be assigned to None and then should be dropped before handing over to LCA.

image

Assignment to root that needs to be changed: https://github.com/KwanLab/Autometa/blob/baf61c04dddf5b33bb825dba2841de1e38dffefe/autometa/taxonomy/ncbi.py#L453-L457

evanroyrees commented 1 year ago

This could be handled in multiple areas. Either specifically when parsing taxids for NCBI or at the LCA step. Any preference?

https://github.com/KwanLab/Autometa/blob/baf61c04dddf5b33bb825dba2841de1e38dffefe/autometa/taxonomy/lca.py#L365-L399

# discard root taxid from set of query's taxids before #L370
qseqid_taxids.discard(root_taxid)
# or something like this?
from autometa.taxonomy.database import TaxonomyDatabase
qseqid_taxids.discard(TaxonomyDatabase.UNCLASSIFIED_TAXID)

https://github.com/KwanLab/Autometa/blob/baf61c04dddf5b33bb825dba2841de1e38dffefe/autometa/taxonomy/lca.py#L370