Open cjfields opened 9 years ago
Original Redmine Comment Author Name: Jason Stajich Original Date: 2011-07-16T01:20:15Z
I’m not aware than anyone has volunteered to test and develop improvements to this module to handle any changes in BLAST+ so I’m not surprised that it isn’t working…
Original Redmine Comment Author Name: Chris Fields Original Date: 2011-08-09T17:51:54Z
The main problem with relying on parsing text BLAST output is that NCBI reserves the right to make changes at any point to the format, potentially breaking any parser. IIRC the biopython parser also has this problem, and the general consensus is to use something more reliable (e.g. XML output).
Original Redmine Comment Author Name: Chris Fields Original Date: 2011-08-09T17:59:57Z
The test results are a bit different for me, run locally using bioperl-live, perl 5.14.0 (text and XML parsing). In general the old parser output works fine (BLAST):
[cjfields@pyrimidine1 old]$ perl parser.pl test_blastall.txt blast
#Query Hit Score Bits Evalue
first ref|NM_001130955.1| 21 42.1 2e-04
first ref|NM_015318.3| 21 42.1 2e-04
second ref|NM_001166295.1| 21 42.1 2e-04
second ref|NM_001166294.1| 21 42.1 2e-04
second ref|NM_001166417.1| 21 42.1 2e-04
second ref|NM_001166293.1| 21 42.1 2e-04
second ref|NM_014021.3| 21 42.1 2e-04
third ref|NM_007249.4| 21 42.1 2e-04
[cjfields@pyrimidine1 old]$ perl parser.pl test_blastall.xml blastxml
#Query Hit Score Bits Evalue
first gi|195972856|ref|NM_001130955.1| 21 42.1223 0.000184141
first gi|195972854|ref|NM_015318.3| 21 42.1223 0.000184141
second gi|261878474|ref|NM_001166295.1| 21 42.1223 0.000184141
second gi|261878472|ref|NM_001166294.1| 21 42.1223 0.000184141
second gi|261878551|ref|NM_001166417.1| 21 42.1223 0.000184141
second gi|261878470|ref|NM_001166293.1| 21 42.1223 0.000184141
second gi|261878468|ref|NM_014021.3| 21 42.1223 0.000184141
third gi|115392135|ref|NM_007249.4| 21 42.1223 0.000184141
The main difference is with newer output (I believe due to a switch in the tags used for XML query names). This is BLAST+:
[cjfields@pyrimidine1 new]$ perl parser.pl test_blastn.txt blast
#Query Hit Score Bits Evalue
first ref|NM_001130955.1| 42 39.2 6e-04
first ref|NM_015318.3| 42 39.2 6e-04
second ref|NM_001166295.1| 42 39.2 6e-04
second ref|NM_001166294.1| 42 39.2 6e-04
second ref|NM_001166417.1| 42 39.2 6e-04
second ref|NM_001166293.1| 42 39.2 6e-04
second ref|NM_014021.3| 42 39.2 6e-04
third ref|NM_007249.4| 42 39.2 6e-04
[cjfields@pyrimidine1 new]$ perl parser.pl test_blastn.xml blastxml
#Query Hit Score Bits Evalue
Query_1 gi|195972856|ref|NM_001130955.1| 42 39.1570490084919 0.000615407041092949
Query_1 gi|195972854|ref|NM_015318.3| 42 39.1570490084919 0.000615407041092949
Query_2 gi|261878474|ref|NM_001166295.1| 42 39.1570490084919 0.000615407041092949
Query_2 gi|261878472|ref|NM_001166294.1| 42 39.1570490084919 0.000615407041092949
Query_2 gi|261878551|ref|NM_001166417.1| 42 39.1570490084919 0.000615407041092949
Query_2 gi|261878470|ref|NM_001166293.1| 42 39.1570490084919 0.000615407041092949
Query_2 gi|261878468|ref|NM_014021.3| 42 39.1570490084919 0.000615407041092949
Query_3 gi|115392135|ref|NM_007249.4| 42 39.1570490084919 0.000615407041092949
The E-values are formatted differently, but that’s not completely unexpected; they won’t be exactly alike due to formatting differences for text output.
Author Name: yi xianfu (yi xianfu) Original Redmine Issue: 3265, https://redmine.open-bio.org/issues/3265 Original Date: 2011-07-16 Original Assignee: Bioperl Guts
Hi Developers,
When I was using BioPerl(Bio::SearchIO) to parse the blast results, I found that it can parse the blast(2.2.25) result but not the blast+(2.2.25+) result (all were in the default format).
The result of blast is parsed properly. But the result of parsing blast+ has no output (except the header I printed).
Besides, according to BioPerl’s wiki(http://www.bioperl.org/wiki/HOWTO%3aSearchIO) (NCBI-BLAST parsing problems), XML format is recommended. But I found that Bioperl has problems in parsing blast or blast+ results in XML format.
The result of parsing blast in XML format outputs the first query only , while I have three queries. The result of parsing blast+ in XML format has the same problem. Besides, it can not get the query id properly: they are all named “Query_1”.
BTW: The same bug about XML format has been reported. Bug #3154: SearchIO:blastxml does not return correct query name for blastp (2.2.24) https://redmine.open-bio.org/issues/3154 —————————————————————————————————————
All files for test has been attached. BTW, the FASTA file for database building is too big, so I give the website for downloading instead of the real file. “README” is recommended to read.
System: Ubuntu 10.10. Version: Perl-5.10.1; BioPerl-1.6.901(not sure). BTW: The problem has been asked on BioStar (http://biostar.stackexchange.com/questions/10295/bioperl-has-different-behaviours-in-parsing-blast-and-blast-result).
Thanks, yixf