emepyc / Blast2lca

Calculates the lowest common ancestors of each query sequence in a Blast result
GNU General Public License v2.0
31 stars 9 forks source link

Results comprise mostly 'unknown' values #3

Open bede opened 9 years ago

bede commented 9 years ago

I would love to get Blast2lca working as it claims to do exactly what I need, and seems fast and nicely put together. However, the results I'm getting from it contain mostly unknown values, or are from the wrong kingdom! Perhaps I'm doing something horribly wrong...

I'm using BLAST+ 2.2.31, and doing nt query searches with blastn against the NCBI Virus RefSeq nt database. I believe that all of the GIs used in this DB are all contained within the master taxonomy – do tell me if I am mistaken.

Command:

blast2lca -savemem -dict gi_taxid_nucl.dmp -nodes taxdump/nodes.dmp -names taxdump/names.dmp -nprocs 12 -levels=superkingdom bat_fa/30.inc_orphans.clean.dedup.blast.txt > 30.lca

Example BLAST+ 2.2.31 output for 1 particular sequence (which gives unknown lca). I'm including this because I know the NCBI sometime changes the format by accident.

M03615:8:000000000-AFM3J:1:1102:19068:15348/1   gi|9629357|ref|NC_001802.1| 96.67   180 6   0   1   180 2824    2645    3e-80   298
M03615:8:000000000-AFM3J:1:1102:19068:15348/1   gi|9626701|ref|NC_001482.1| 70.19   161 46  2   5   164 3058    2899    1e-10   68.0
M03615:8:000000000-AFM3J:1:1102:19068:15348/1   gi|27311166|ref|NC_004455.1|    72.84   81  22  0   4   84  3276    3196    9e-05   48.2

However, more interestingly still, I get completely wrong results for a small fraction of results which are not unknown. Blast2lca output:

M03615:9:000000000-ADYY2:1:1106:23915:22782/2   Arthrobacter sp. MU2A-20    species Bacteria

...Yet the BLAST results are from a search against a viral subset of RefSeq. The GI of the sole blast result points to a hepatitis virus!

Any wisdom regarding what might possibly be going on here would be gratefully received!

Thank you.

martin-steinegger commented 6 years ago

Might be solve by this pull request https://github.com/emepyc/Blast2lca/pull/5