bioperl / bioperl-live-redmine

Legacy tickets migrated from the OBF Redmine issue tracker: http://redmine.open-bio.org
0 stars 0 forks source link

WU-BLAST XML support #45

Open cjfields opened 9 years ago

cjfields commented 9 years ago

Author Name: Jason Wood (Jason Wood) Original Redmine Issue: 2686, https://redmine.open-bio.org/issues/2686 Original Date: 2008-11-25 Original Assignee: Chris Fields


Regular expressions in the Bio::SearchIO::blastxml package are too specific in the _chunk_normalblast method to work with WU-BLAST 2.0 xml output (mformat=7 option). I have a patch for the current HEAD (rev 14987) available to fix this problem.

Output created with the xmlcompact option is not parsed because no line feeds are included. I was unable to figure out a patch that wouldn’t break anything, so I currently just preparse the file with the following code:

my $fh = IO::File->new($ARGV[0]); my $tfh = IO::File->new_tmpfile or die “Unable to open temp file: $!”; foreach my $line (<$fh>) { $line =~ s/></>\n</g; print $tfh $line; } my $searchIO = new Bio::SearchIO(-format => ‘blastxml’, -fh => $tfh);

I sure there is a simple fix for this, but I don’t know where best to look in the current code base.

cjfields commented 9 years ago

Original Redmine Comment Author Name: Jason Wood Original Date: 2008-11-25T16:46:52Z


Created an attachment (id=1087) Patch for WU-BLAST support

cjfields commented 9 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2008-11-25T18:14:22Z


I committed the patch to svn. I can try adding the preprocessing step for xmlcompact data into the main loop, just need some test data.

cjfields commented 9 years ago

Original Redmine Comment Author Name: Jason Wood Original Date: 2008-11-26T11:54:14Z


Created an attachment (id=1114) WU-BLAST xmlcompact example

This was generated on a Condor cluster using WU-BLAST 2.0 with the xmlcompact option. It is malformed xml, with the output of each process concatenated into one file.

cjfields commented 9 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2008-11-26T12:04:07Z


(In reply to comment #3)

Created an attachment (id=1114) [details] WU-BLAST xmlcompact example

This was generated on a Condor cluster using WU-BLAST 2.0 with the xmlcompact option. It is malformed xml, with the output of each process concatenated into one file.

Pushing to 1.6.x, though I’ll look into getting it done for 1.6 depending on how hard it will be to implement. Looks pretty scary!

cjfields commented 9 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2008-11-26T13:10:51Z


I have added suport for xmlcompact to blastxml in svn (pass in ‘-xmlcompact => 1’ as additional parameter/flag to the parser to activate preprocessing). Wasn’t too hard to enter in (it just adds a pre-preprocessing step, then switches out the filehandles) but it may need more work.

Closing out.

cjfields commented 9 years ago

Original Redmine Comment Author Name: Jason Wood Original Date: 2008-12-08T15:45:26Z


There is still a bug to be figured out with the xmlcompact option. The swap regex runs out of memory / stack space when dealing with very long strings (much longer than the attached example). The only solution that I could figure out is a dirty dirty hack, not a proper fix. Instead of trying to run the swap on the entire $line, I break the $line into multiple 1000 char strings before running the swap regex. This method allows the regex engine to do its thing without running out of memory, but it adds quite a bit of unneeded complexity.

Perhaps some regex guru has a better way of doing this…

(Note: This code snippet replaces the code in my previous comment, I can produce a patch against the code in blastxml if this method is acceptable)

my $pattern = “><”; my $replacement = “>\n<”; my $max_length = 1000;

my $fh = IO::File->new($ARGV[0]); my $tfh = IO::File->new_tmpfile or die “Unable to open temp file: $!”; foreach my $line (<$fh>) { my $length = length $line; for (my $i=0; $i<=int $length / $max_length; $i++) { my $l; if ($max_length $i < $length) { $l = substr($line, $max_length $i; $max_length); } else { $l = substr($line, $max_length); } $l =~ s/?>$pattern/$replacement/g; print $tfh $l; } my $searchIO = new Bio::SearchIO(-format => ‘blastxml’, -fh => $tfh);

cjfields commented 9 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2008-12-08T15:51:13Z


(In reply to comment #6)

There is still a bug to be figured out with the xmlcompact option. The swap regex runs out of memory / stack space when dealing with very long strings (much longer than the attached example). The only solution that I could figure out is a dirty dirty hack, not a proper fix. Instead of trying to run the swap on the entire $line, I break the $line into multiple 1000 char strings before running the swap regex. This method allows the regex engine to do its thing without running out of memory, but it adds quite a bit of unneeded complexity.

Perhaps some regex guru has a better way of doing this…

(Note: This code snippet replaces the code in my previous comment, I can produce a patch against the code in blastxml if this method is acceptable)

my $pattern = “><”; my $replacement = “>\n<”; my $max_length = 1000;

my $fh = IO::File->new($ARGV[0]); my $tfh = IO::File->new_tmpfile or die “Unable to open temp file: $!”; foreach my $line (<$fh>) { my $length = length $line; for (my $i=0; $i<=int $length / $max_length; $i++) { my $l; if ($max_length $i < $length) { $l = substr($line, $max_length $i; $max_length); } else { $l = substr($line, $max_length); } $l =~ s/?>$pattern/$replacement/g; print $tfh $l; } my $searchIO = new Bio::SearchIO(-format => ‘blastxml’, -fh => $tfh);

I think there is a way to do this using a closure and s///; will look into when I have time. May not make it into 1.6 though, but it’ll be fixed for 1.6.x.