Closed guifftc closed 7 years ago
Hi guifftc,
it can happen that a sequence becomes taxonomic annotations at different ranks that do not correspond.
Imagine this case, you have a human sequence that hits 3 subjects:
This would result in:
I hope that example helps illustrate why that happens. "undef" occurs when a subject doesn't have a particular taxonomic rank (which happens more often than one thinks).
If this is your problem and you are not happy, I recommend you run:
blobtools view -i refseq.blobDB.json --rank all --hits
to look why blobtools decided upon a particular taxonomy.
You could also play with the blobotools create
parameter:
-m, --min_diff <FLOAT> Minimal score difference between highest scoring
taxonomies (otherwise "unresolved") [default: 0.0]
Please let me know if this was of help.
Cheers,
dom
Thanks for the fast and clear reply,
I didn't realize the --rank all was so informative. Then, if I understood properly, the -m option, eg. 10 or 50, increases the number of unresolved, so it minimizes the taxonomy affiliation error. Do you you have a suggestion of this value? Otherwise, I guess reporting a distribution between the 2 first bitscores for the whole assembly could help.
Hi,
sorry for not replying earlier. Sensible values for -m
depend ultimately on your organism and the type of sequence similarity search result you supply and against which database it was run. But yes, it will decrease the number of false positives.
Dear developers,
blobtools view -i refseq.blobDB.json
The command outputs a tabular file, in column 6 the taxonomic affiliation at a phylum level is reported based on bitscore bestsum. When adding the rank with "-r genus" option, some sequences have distinct affiliation, and not simply undef, but bacteria when phylum, and eukaryote when genus.
The hit file was done using refseq diamond with e-value 1-10 and 10 best hits.
thanks for your time and effort.