bioperl / bioperl-live-redmine

Legacy tickets migrated from the OBF Redmine issue tracker: http://redmine.open-bio.org
0 stars 0 forks source link

BioPerl(Bio::SearchIO) has different behaviours in parsing blast and blast+ result #115

Open cjfields opened 8 years ago

cjfields commented 8 years ago

Author Name: yi xianfu (yi xianfu) Original Redmine Issue: 3265, https://redmine.open-bio.org/issues/3265 Original Date: 2011-07-16 Original Assignee: Bioperl Guts


Hi Developers,

When I was using BioPerl(Bio::SearchIO) to parse the blast results, I found that it can parse the blast(2.2.25) result but not the blast+(2.2.25+) result (all were in the default format).

The result of blast is parsed properly. But the result of parsing blast+ has no output (except the header I printed).


Besides, according to BioPerl’s wiki(http://www.bioperl.org/wiki/HOWTO%3aSearchIO) (NCBI-BLAST parsing problems), XML format is recommended. But I found that Bioperl has problems in parsing blast or blast+ results in XML format.

The result of parsing blast in XML format outputs the first query only , while I have three queries. The result of parsing blast+ in XML format has the same problem. Besides, it can not get the query id properly: they are all named “Query_1”.

BTW: The same bug about XML format has been reported. Bug #3154: SearchIO:blastxml does not return correct query name for blastp (2.2.24) https://redmine.open-bio.org/issues/3154 —————————————————————————————————————

All files for test has been attached. BTW, the FASTA file for database building is too big, so I give the website for downloading instead of the real file. “README” is recommended to read.


System: Ubuntu 10.10. Version: Perl-5.10.1; BioPerl-1.6.901(not sure). BTW: The problem has been asked on BioStar (http://biostar.stackexchange.com/questions/10295/bioperl-has-different-behaviours-in-parsing-blast-and-blast-result).

Thanks, yixf

cjfields commented 8 years ago

Original Redmine Comment Author Name: Jason Stajich Original Date: 2011-07-16T01:20:15Z


I’m not aware than anyone has volunteered to test and develop improvements to this module to handle any changes in BLAST+ so I’m not surprised that it isn’t working…

cjfields commented 8 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2011-08-09T17:51:54Z


The main problem with relying on parsing text BLAST output is that NCBI reserves the right to make changes at any point to the format, potentially breaking any parser. IIRC the biopython parser also has this problem, and the general consensus is to use something more reliable (e.g. XML output).

cjfields commented 8 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2011-08-09T17:59:57Z


The test results are a bit different for me, run locally using bioperl-live, perl 5.14.0 (text and XML parsing). In general the old parser output works fine (BLAST):

[cjfields@pyrimidine1 old]$ perl parser.pl test_blastall.txt blast
#Query  Hit Score   Bits    Evalue
first   ref|NM_001130955.1| 21  42.1    2e-04
first   ref|NM_015318.3|    21  42.1    2e-04
second  ref|NM_001166295.1| 21  42.1    2e-04
second  ref|NM_001166294.1| 21  42.1    2e-04
second  ref|NM_001166417.1| 21  42.1    2e-04
second  ref|NM_001166293.1| 21  42.1    2e-04
second  ref|NM_014021.3|    21  42.1    2e-04
third   ref|NM_007249.4|    21  42.1    2e-04
[cjfields@pyrimidine1 old]$ perl parser.pl test_blastall.xml blastxml
#Query  Hit Score   Bits    Evalue
first   gi|195972856|ref|NM_001130955.1|    21  42.1223 0.000184141
first   gi|195972854|ref|NM_015318.3|   21  42.1223 0.000184141
second  gi|261878474|ref|NM_001166295.1|    21  42.1223 0.000184141
second  gi|261878472|ref|NM_001166294.1|    21  42.1223 0.000184141
second  gi|261878551|ref|NM_001166417.1|    21  42.1223 0.000184141
second  gi|261878470|ref|NM_001166293.1|    21  42.1223 0.000184141
second  gi|261878468|ref|NM_014021.3|   21  42.1223 0.000184141
third   gi|115392135|ref|NM_007249.4|   21  42.1223 0.000184141

The main difference is with newer output (I believe due to a switch in the tags used for XML query names). This is BLAST+:

[cjfields@pyrimidine1 new]$ perl parser.pl test_blastn.txt blast
#Query  Hit Score   Bits    Evalue
first   ref|NM_001130955.1| 42  39.2    6e-04
first   ref|NM_015318.3|    42  39.2    6e-04
second  ref|NM_001166295.1| 42  39.2    6e-04
second  ref|NM_001166294.1| 42  39.2    6e-04
second  ref|NM_001166417.1| 42  39.2    6e-04
second  ref|NM_001166293.1| 42  39.2    6e-04
second  ref|NM_014021.3|    42  39.2    6e-04
third   ref|NM_007249.4|    42  39.2    6e-04
[cjfields@pyrimidine1 new]$ perl parser.pl test_blastn.xml blastxml
#Query  Hit Score   Bits    Evalue
Query_1 gi|195972856|ref|NM_001130955.1|    42  39.1570490084919    0.000615407041092949
Query_1 gi|195972854|ref|NM_015318.3|   42  39.1570490084919    0.000615407041092949
Query_2 gi|261878474|ref|NM_001166295.1|    42  39.1570490084919    0.000615407041092949
Query_2 gi|261878472|ref|NM_001166294.1|    42  39.1570490084919    0.000615407041092949
Query_2 gi|261878551|ref|NM_001166417.1|    42  39.1570490084919    0.000615407041092949
Query_2 gi|261878470|ref|NM_001166293.1|    42  39.1570490084919    0.000615407041092949
Query_2 gi|261878468|ref|NM_014021.3|   42  39.1570490084919    0.000615407041092949
Query_3 gi|115392135|ref|NM_007249.4|   42  39.1570490084919    0.000615407041092949

The E-values are formatted differently, but that’s not completely unexpected; they won’t be exactly alike due to formatting differences for text output.