Open chasemc opened 1 year ago
This could be handled in multiple areas. Either specifically when parsing taxids for NCBI or at the LCA step. Any preference?
# discard root taxid from set of query's taxids before #L370
qseqid_taxids.discard(root_taxid)
# or something like this?
from autometa.taxonomy.database import TaxonomyDatabase
qseqid_taxids.discard(TaxonomyDatabase.UNCLASSIFIED_TAXID)
Currently the documentation instructs and the code downloads
prot.accession2taxid.gz
which doesn't have all of thenr
accessions. Proteins that aren't found inprot.accession2taxid.gz
are assigned to root which results in contigs becoming unclassified. Currently this is ameliorated by usingprot.accession2taxid.FULL.gz
instead ofprot.accession2taxid.gz
, as shown below. But the code needs to be changed to handle missing accessions. Per our meeting today these should probably be assigned toNone
and then should be dropped before handing over to LCA.Assignment to root that needs to be changed: https://github.com/KwanLab/Autometa/blob/baf61c04dddf5b33bb825dba2841de1e38dffefe/autometa/taxonomy/ncbi.py#L453-L457