bioperl / bioperl-live-redmine

Legacy tickets migrated from the OBF Redmine issue tracker: http://redmine.open-bio.org
0 stars 0 forks source link

Bio::SeqIO Interproscan XML parsing issue #151

Open cjfields opened 8 years ago

cjfields commented 8 years ago

Author Name: Ben L (Ben L) Original Redmine Issue: 3452, https://redmine.open-bio.org/issues/3452 Original Date: 2013-11-08


Please see this post:

http://lists.open-bio.org/pipermail/bioperl-l/2013-November/071273.html

Thank you very much.

Ben

cjfields commented 8 years ago

Original Redmine Comment Author Name: Francisco J. Ossandon Original Date: 2013-12-30T16:22:39Z


I wrote this by email but I will also leave it here for the record… In summary the parsing broke because there is a new Interproscan XML format. More below…

I’ve looked more into it and now that the match error is fixed, I can see that the reason that there is no data recovered from the parsing is because the XML structure is completely changed from what BioPerl is currently expecting.

This comes from the XML format change from the previous InterProScan 4 (https://www.ebi.ac.uk/Tools/pfa/iprscan/, the one recognized by BioPerl) and the current InterProScan 5 (https://www.ebi.ac.uk/interpro/, https://www.ebi.ac.uk/interpro/resources/schemas/interproscan5/). The new format looks more complex and have different type of matches (coils-match, fingerprints-match, hmmer2-match, hmmer3-match, panther-match, patternscan-match, etc.)

BioPerl looks for the structure “/protein/interpro/match” (InterProScan4): @

@ But the new structure is “”/protein/matches/[different_types] " (InterProScan5): @ MSSHSAPTALQDGAALWSALCVQLELVTSPQQFNTWLRPLRGELQGHELRLLAPNPFVRDWVRERMAELVKEQLQRIAPGFELVFALDEEAAAATSAPTASIAPERSSAPGGHRLNPAFNFQSYVEGKSNQLALAAARQVAQHPGKSYNPLYIYGGVGLGKTHLMQAVGNDILQRQPEAKVLYISSEGFIMDMVRSLQHNTINDFKQRYRKLDALLIDDIQFFAGKDRTQEEFFHTFNALFDGGRQIIITCDRYPKEVEGLEERLQSRFGWGLTVAIQPHDLETRMAIVLCKAEDHGIQLPEEVAFFIAEKIRSHVRELEGALRRVIAHVNFTHKPYSVESAKEALRDLIDVQKRMVSLENIQKVVADYYHIRASEMQSKRRNRNVARPRQMAMALTKELTRHSLPEIGEAFGGRDHTTVLHACRQIEKLRRESAQIEEDYRNLIRILGA @ #### I’m attaching the output given by the use of IPS4 and IPS5 using the same query sequence for comparison, the new output is much bigger. This means that new code will be needed to properly extract the data from the new format… In summary, this case is more an enhancement work (support new IPS5 format) than a bug from the now older IPS4 format.
cjfields commented 8 years ago

Original Redmine Comment Author Name: Jason Stajich Original Date: 2014-02-23T06:16:41Z


sounds like you need a version to parse new version but still maintain the old behavior.

cjfields commented 8 years ago

Original Redmine Comment Author Name: Francisco J. Ossandon Original Date: 2014-03-04T22:23:27Z


The new parsing require more time than the one I can afford right now, so I leave open to anyone this development. =)

The XML files attached in this report can be used as start point to that end.