emepyc / Blast2lca

Calculates the lowest common ancestors of each query sequence in a Blast result
GNU General Public License v2.0
31 stars 9 forks source link

blastn -outfmt 6 #1

Closed nick-youngblut closed 10 years ago

nick-youngblut commented 10 years ago

I've run blast2lca with identical parameters for 2 different blast files of the same query nucleotide sequences: the first is blastall -m 8 output; the second is blastn -outfmt 6 output. I only am getting warnings when calling blast2lca with the -outfmt 6 blast output file. An example warning: """ 2014/03/04 13:58:03 WARNING: Ignoring blast line: M02465:2:000000000-A5D51:1:1101:16673:1363/1 gi|407879691|emb|HE804045.1| 84.43 122 19 0 13 134 905808 905687 2e-24 121 """

Can you tell me what is going on here?

Thanks. Nick

emepyc commented 10 years ago

If you have used the identical parameters and the same query sequences the output should be about the same, right? Even if there are differences the format should be identical (both blastall -m8 and blast+ -outfmt 6 should give you the same columns by default). Can you confirm this? Can you post here some lines from each of the outputs? Thanks

nick-youngblut commented 10 years ago

The output files appear the same in format, but the number of hits returns is different with default setting between 'blastall -p blastn' vs 'blastn'. I tried looking at your code to figure out what was causing the warnings (I believe from line 205 in blastm8.go), but couldn't determine why your blast parser is throwing errors. I'm not getting the parseBlast errors. Here's the top lines of the blast input that I'm using:

==> blastall -p blastn -m 8 <== M02465:2:000000000-A5D51:1:1101:13618:1497/1 gi|120604516|gb|CP000539.1| 95.83 24 1 0 110 133 2053097 2053120 7.8 40.1 M02465:2:000000000-A5D51:1:1101:13618:1497/1 gi|149465826|gb|AC205647.1| 92.86 28 2 0 88 115 72190 72163 7.8 40.1 M02465:2:000000000-A5D51:1:1101:13618:1497/1 gi|152026452|gb|CP000769.1| 100.00 22 0 0 54 75 4657772 4657793 0.50 44.1 M02465:2:000000000-A5D51:1:1101:13618:1497/1 gi|164423671|ref|XM_957612.2| 100.00 20 0 0 108 127 21 195 7.8 40.1 M02465:2:000000000-A5D51:1:1101:13618:1497/1 gi|170937689|emb|CU633749.1| 100.00 20 0 0 104 123 2939194 2939175 7.8 40.1 M02465:2:000000000-A5D51:1:1101:13618:1497/1 gi|190694918|gb|CP001074.1| 95.83 24 1 0 39 62 3069888 3069865 7.8 40.1 M02465:2:000000000-A5D51:1:1101:13618:1497/1 gi|199580146|gb|AC189450.2| 100.00 20 0 0 102 121 83673 83654 7.8 40.1 M02465:2:000000000-A5D51:1:1101:13618:1497/1 gi|219544946|gb|CP001338.1| 100.00 20 0 0 68 87 515019 515000 7.8 40.1 M02465:2:000000000-A5D51:1:1101:13618:1497/1 gi|221728669|gb|CP001392.1| 95.83 24 1 0 110 133 1870059 1870082 7.8 40.1 M02465:2:000000000-A5D51:1:1101:13618:1497/1 gi|226088597|dbj|AP009153.1| 90.62 32 3 0 160 191 1311140 1311109 7.8 40.1

==> blastn -outmft 6 <== M02465:2:000000000-A5D51:1:1101:14467:1474/1 gi|119534933|gb|CP000509.1| 83.81 210 30 4 26 233 1910472 1910679 6e-47 196 M02465:2:000000000-A5D51:1:1101:14467:1474/1 gi|283807292|gb|CP001736.1| 84.30 223 27 8 1 219 5382557 5382339 2e-51 211 M02465:2:000000000-A5D51:1:1101:14467:1474/1 gi|556031042|gb|CP006272.1| 91.36 162 12 2 73 233 8127386 8127226 4e-54 220 M02465:2:000000000-A5D51:1:1101:14851:1373/1 gi|161158851|emb|AM746676.1| 91.62 167 13 1 1 166 9726795 9726961 4e-57 230 M02465:2:000000000-A5D51:1:1101:14851:1373/1 gi|520999024|gb|CP003969.1| 91.67 168 11 3 1 167 11526749 11526914 4e-57 230 M02465:2:000000000-A5D51:1:1101:15007:1502/1 gi|110808925|gb|DQ823200.1| 96.15 234 9 0 1 234 1346 1113 4e-103 383 M02465:2:000000000-A5D51:1:1101:15007:1502/1 gi|117580706|gb|DQ906785.1| 96.10 231 9 0 4 234 1352 1122 2e-101 377 M02465:2:000000000-A5D51:1:1101:15007:1502/1 gi|117580719|gb|DQ906798.1| 97.40 231 6 0 4 234 1354 1124 2e-106 394 M02465:2:000000000-A5D51:1:1101:15007:1502/1 gi|117580775|gb|DQ906854.1| 95.30 234 11 0 1 234 1357 1124 9e-100 372 M02465:2:000000000-A5D51:1:1101:15007:1502/1 gi|119394484|gb|EF133415.1| 97.44 234 6 0 1 234 30 73 4e-108 399

All of the resulting lca classifications for '-outfmt 6' are 'unknown' or Bacteria, while for '-m 8' they are classified down to the species level.

Nick

nick-youngblut commented 10 years ago

The only thing that I could find different between the blast output formats is that the bitscore values '-m 8' can have 0-1 spaces before the number while for '-outfmt 6' the values can have 0-2 spaces before the number. All of the lines that gave a warning in blast2lca have 2 spaces before the bitscore.

In the '-outfmt 6' file, I converted all double spaces to single spaces and ran it through blast2lca. With the edited file, there were no warnings and the taxonomic classifications were the same as with '-m 8'.

So it appears that the bug is double spaces in the bitscore field of the '-outfmt 6' formatted blast output.

emepyc commented 10 years ago

Thanks for looking into this. I'm fixing this now

emepyc commented 10 years ago

I have just pushed a fix. I don't have a -outfmt 6 blast here. Could you please test the fix and report back? Thanks again for your help in tracking that down

nick-youngblut commented 10 years ago

I¹m getting what appears to be an error when I try to update blast2lca (command: 'go get -u github.com/emepyc/Blast2lca/blast2lca'):

³²"

github.com/emepyc/Blast2lca/blastm8

/opt/go/gocode/src/github.com/emepyc/Blast2lca/blastm8/blastm8.go:227: no new variables on left side of := ³²"

Nick

emepyc commented 10 years ago

Indeed. Now fixed

nick-youngblut commented 10 years ago

It appears to be working just fine now. Thanks for the update.

Nick

emepyc commented 10 years ago

Thanks