genus & phylum taxonomic affiliation do not correspond

guifftc commented 7 years ago

Dear developers,

blobtools view -i refseq.blobDB.json

The command outputs a tabular file, in column 6 the taxonomic affiliation at a phylum level is reported based on bitscore bestsum. When adding the rank with "-r genus" option, some sequences have distinct affiliation, and not simply undef, but bacteria when phylum, and eukaryote when genus.

The hit file was done using refseq diamond with e-value 1-10 and 10 best hits.

thanks for your time and effort.

DRL commented 7 years ago

Hi guifftc,

it can happen that a sequence becomes taxonomic annotations at different ranks that do not correspond.

Imagine this case, you have a human sequence that hits 3 subjects:

pig with 300 bitscore (Eukaryota, Chordata, Sus)
worm with 200 bitscore (Eukaryota, Nematoda, Caenorhabditis)
ecoli (contaminated with gorilla) with 400 bitscore (Bacteria, Proteobacteria, Escherichia)

This would result in:

Kingdom: Eukaryota (500)
Phylum: Proteobacteria (400)
Genus: Escherichia (400)

I hope that example helps illustrate why that happens. "undef" occurs when a subject doesn't have a particular taxonomic rank (which happens more often than one thinks).

If this is your problem and you are not happy, I recommend you run: blobtools view -i refseq.blobDB.json --rank all --hits to look why blobtools decided upon a particular taxonomy.

You could also play with the blobotools create parameter:

-m, --min_diff <FLOAT>          Minimal score difference between highest scoring
                                 taxonomies (otherwise "unresolved") [default: 0.0]

Please let me know if this was of help.

Cheers,

dom

guifftc commented 7 years ago

Thanks for the fast and clear reply,

I didn't realize the --rank all was so informative. Then, if I understood properly, the -m option, eg. 10 or 50, increases the number of unresolved, so it minimizes the taxonomy affiliation error. Do you you have a suggestion of this value? Otherwise, I guess reporting a distribution between the 2 first bitscores for the whole assembly could help.

DRL commented 7 years ago

Hi,

sorry for not replying earlier. Sensible values for -m depend ultimately on your organism and the type of sequence similarity search result you supply and against which database it was run. But yes, it will decrease the number of false positives.

DRL / blobtools

genus & phylum taxonomic affiliation do not correspond #51