Open mpgriesh opened 1 year ago
Dear @mpgriesh,
Thanks for this report! It's something that we have noticed ourselves as well, and I think it stems from the fact that many E. coli are misannotated in NCBI nr, as human for example. It may also be that foreign vectors within lab coli's are annotated as coli... There is currently not an easy solution for this that we can implement except cleaning nr ourselves (we're thinking on how to do this automatically).
For now, could you try one of the latest databases (see https://tbb.bio.uu.nl/tina/CAT_prepare/)? NCBI is removing misannotations so newer databases may be better, then again they may also contain more misannotations. A more viable alternative may be to use the GTDB databse instead of NCBI nr (we have implemented it in the latest version of CAT). GTDB does not have the nr misannotations, but of course does have a smaller search space. You can also find the latest GTDB database formatted for CAT on https://tbb.bio.uu.nl/tina/CAT_prepare/.
Do keep me updated!
Best wishes,
Bastiaan
Hi, first of all, thanks for the awesome too. I am using it in my metagenome/transcriptome pipeline, but I am having similar problems with this. In one example, there was one viral contig whose proteins have hundreds of hits against the correct virus sequences in NR, but there was also one artificial construct whose fastaid2LCAtaxid
is 1. This ruins the taxonomy assignment of that entire contig. This also happens for other contigs. I imagine similar entries in the NR database like artificial sequences or metagenome sequences would ruin taxonomy assignment similarly.
I know GTDB is curated and a potential solution, but it doesn't have the virus and eukaryotes that I need. I think there may be two potential solution.
-f
. IN this case, the smaller number of "poisonous" proteins in the database wouldn't matter. This could make the program a lot slower though.
Hi, I am running CAT to annotate contigs from metagenomes that are known to be E. coli by several other approaches. For example, greater than 99% average nucleotide identity with reference E. coli genomes and BLASTx also identifies the vast majority of ORFs as E. coli. For me, CAT identifies several other contigs with species support in the metagenome, and the db and taxonomy folders from CAT prepare seem to be correct as no errors are encountered.
Despite this, across many samples, of 239 contigs expected to be E. coli, 9 are classified at the family level (Enterobacteriaceae) and the rest are classified as Bacteria.
When I look at the ORFs themselves following add_names, I see something similar as the vast majority of individual ORFs receive no support below Bacteria.
I tried this with a lab E. coli reference genome as a single "contig" and it is classified as Bacteria.
I am using CAT v5.2.3 and the original version of the DB from 2021-01-07 as described in the repo. I do see several strains and species of E. coli in names.dmp and the taxid's are present in nodes.dmp.