MGXlab / CAT_pack

CAT/BAT/RAT: tools for taxonomic classification of contigs and metagenome-assembled genomes (MAGs) and for taxonomic profiling of metagenomes
MIT License
191 stars 30 forks source link

No support below kingdom level for known E. coli contigs #92

Open mpgriesh opened 1 year ago

mpgriesh commented 1 year ago

Hi, I am running CAT to annotate contigs from metagenomes that are known to be E. coli by several other approaches. For example, greater than 99% average nucleotide identity with reference E. coli genomes and BLASTx also identifies the vast majority of ORFs as E. coli. For me, CAT identifies several other contigs with species support in the metagenome, and the db and taxonomy folders from CAT prepare seem to be correct as no errors are encountered.

Despite this, across many samples, of 239 contigs expected to be E. coli, 9 are classified at the family level (Enterobacteriaceae) and the rest are classified as Bacteria.

When I look at the ORFs themselves following add_names, I see something similar as the vast majority of individual ORFs receive no support below Bacteria.

I tried this with a lab E. coli reference genome as a single "contig" and it is classified as Bacteria.

image

I am using CAT v5.2.3 and the original version of the DB from 2021-01-07 as described in the repo. I do see several strains and species of E. coli in names.dmp and the taxid's are present in nodes.dmp.

bastiaanvonmeijenfeldt commented 10 months ago

Dear @mpgriesh,

Thanks for this report! It's something that we have noticed ourselves as well, and I think it stems from the fact that many E. coli are misannotated in NCBI nr, as human for example. It may also be that foreign vectors within lab coli's are annotated as coli... There is currently not an easy solution for this that we can implement except cleaning nr ourselves (we're thinking on how to do this automatically).

For now, could you try one of the latest databases (see https://tbb.bio.uu.nl/tina/CAT_prepare/)? NCBI is removing misannotations so newer databases may be better, then again they may also contain more misannotations. A more viable alternative may be to use the GTDB databse instead of NCBI nr (we have implemented it in the latest version of CAT). GTDB does not have the nr misannotations, but of course does have a smaller search space. You can also find the latest GTDB database formatted for CAT on https://tbb.bio.uu.nl/tina/CAT_prepare/.

Do keep me updated!

Best wishes,

Bastiaan

zxl124 commented 5 months ago

Hi, first of all, thanks for the awesome too. I am using it in my metagenome/transcriptome pipeline, but I am having similar problems with this. In one example, there was one viral contig whose proteins have hundreds of hits against the correct virus sequences in NR, but there was also one artificial construct whose fastaid2LCAtaxid is 1. This ruins the taxonomy assignment of that entire contig. This also happens for other contigs. I imagine similar entries in the NR database like artificial sequences or metagenome sequences would ruin taxonomy assignment similarly.

I know GTDB is curated and a potential solution, but it doesn't have the virus and eukaryotes that I need. I think there may be two potential solution.

  1. Create a blacklist of tax IDs that find_LCA_for_ORF function would just ignore. This is easier to do, maybe that's something you are considering. In addition to tax ID of 1, is there anything else you think that can be ignored?
  2. Use a voting approach for find_LCA_for_ORF. Like how CAT assign taxonomy based on voting of taxonomy from all ORFs of the contig, you can assign taxonomy to ORFs based on the consensus of all hits based on a similar fraction option like -f. IN this case, the smaller number of "poisonous" proteins in the database wouldn't matter. This could make the program a lot slower though.