bioperl / bioperl-live-redmine

Legacy tickets migrated from the OBF Redmine issue tracker: http://redmine.open-bio.org
0 stars 0 forks source link

Bio::SeqIO::embl->next_seq corrupted with "Segmentation fault" when parsing million-line entries #55

Open cjfields opened 8 years ago

cjfields commented 8 years ago

Author Name: brian li (brian li) Original Redmine Issue: 2823, https://redmine.open-bio.org/issues/2823 Original Date: 2009-05-04 Original Assignee: Bioperl Guts


Platform: BioPerl 1.6.0 Perl 5.8.8 Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server)

When parsing EMBL file rel_ann_mus_01_r99.dat which has big million-line entries, Bio::SeqIO::embl->next_seq gives “Segmentation fault”. This happens when tring to get the first entry with next_seq.

An zipped version of the data file I tried to parse is available at ftp://bio-mirror.net/biomirror/embl/release/rel_ann_mus_01_r99.dat.gz

  1. The code I use my $seqio = Bio::SeqIO->new(-file => ‘rel_ann_mus_01_r99.dat’, -format => ‘EMBL’); my $index = 1; while (my $seq = $seqio->next_seq) { print “Dealing with entry: $index\n”;

    Some parse process

    $index++; }

  2. end of code
cjfields commented 8 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2009-05-04T14:23:05Z


Pretty sure this is Bio::Species related, but I’ll have to delve into it a bit further. Moving to 1.6.x just in case.

cjfields commented 8 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2009-05-04T15:58:23Z


Not a bug, per se. The problem here has to do with the sequences you are trying to load into memory, which represent full-length eukaryotic chromosome builds and relevant features. The first record in the file you are trying to load is:

ID CH466519; SV 1; linear; genomic DNA; ANN; MUS; 112224630 BP.

So, yes, you’ll very likely segfault after attempting to load all annotation, features, and sequence information into memory. As we can’t derive what the memory footprint for any particular Bio::Seq is until it’s loaded there really isn’t much we can do until we create a lazily implemented Bio::SeqI (and the proper iterative interfaces for Features). That’s not high on anyone’s priority list, as most consider the best option is to use a relational database capable of storing the data you need and that can access segments of the sequence you want w/o the memory overhead.

I personally use the Ensembl Perl API, but UCSC and Bio::DB::SeqFeature::Store also come to mind.

cjfields commented 8 years ago

Original Redmine Comment Author Name: brian li Original Date: 2009-05-04T22:23:25Z


Thanks for your suggestion of other APIs. I will try to work with them. I have to add all flat EMBL files into relational databases for easy generation of statistic reports.

I agree with you that it’s not a good idea to load all features and sequences into memory. Then I tried Bio::Seq::SeqBuilder->add_unwanted_slot(‘features’, ‘seq’, ‘annotation’). Segfault popped again. Will unwanted slots still be loaded?

I wonder why there is “Segmentation fault”. Is it because of memory shortage? I have tracked the memory use with free -s 1. The free memory size stays at about 20GB (buffer counted in). Could you tell more about why this error happens.

(In reply to comment #2)

Not a bug, per se. The problem here has to do with the sequences you are trying to load into memory, which represent full-length eukaryotic chromosome builds and relevant features. The first record in the file you are trying to load is:

ID CH466519; SV 1; linear; genomic DNA; ANN; MUS; 112224630 BP.

So, yes, you’ll very likely segfault after attempting to load all annotation, features, and sequence information into memory. As we can’t derive what the memory footprint for any particular Bio::Seq is until it’s loaded there really isn’t much we can do until we create a lazily implemented Bio::SeqI (and the proper iterative interfaces for Features). That’s not high on anyone’s priority list, as most consider the best option is to use a relational database capable of storing the data you need and that can access segments of the sequence you want w/o the memory overhead.

I personally use the Ensembl Perl API, but UCSC and Bio::DB::SeqFeature::Store also come to mind.

cjfields commented 8 years ago

Original Redmine Comment Author Name: brian li Original Date: 2009-05-05T22:52:00Z


I agree with Chris that it’s not a good idea to load all features and sequences into memory. Then I tried Bio::Seq::SeqBuilder->add_unwanted_slot(‘features’, ‘seq’, ‘annotation’). Segfault popped again. Will unwanted slots still be loaded?

cjfields commented 8 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2009-05-06T08:35:53Z


I’ll take a look; it may be incomplete integration of SeqBuilder into EMBL parsing.

cjfields commented 8 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2009-05-06T13:38:31Z


(In reply to comment #5)

I’ll take a look; it may be incomplete integration of SeqBuilder into EMBL parsing.

Appears SeqBuilder is not integrated into Bio::SeqIO::embl at all (nor in many of the other SeqIO parsers).

I’m unsure when this can be tackled. I have started rewriting the GenBank/EMBL/Swiss parsers to centralize data handling better, so it’s probably best to do it there and deprecate the older parsers in favor of the newer ones.

cjfields commented 8 years ago

Original Redmine Comment Author Name: brian li Original Date: 2009-05-06T20:47:56Z


I’m unsure when this can be tackled. I have started rewriting the GenBank/EMBL/Swiss parsers to centralize data handling better, so it’s probably best to do it there and deprecate the older parsers in favor of the newer ones.

Thanks. I will try the new ones when they are completed.