bioperl / bioperl-live-redmine

Legacy tickets migrated from the OBF Redmine issue tracker: http://redmine.open-bio.org
0 stars 0 forks source link

Parsing multiple contig EMBL records (CO lines instead of SQ lines) #67

Open cjfields opened 8 years ago

cjfields commented 8 years ago

Author Name: Peter Cock (Peter Cock) Original Redmine Issue: 2982, https://redmine.open-bio.org/issues/2982 Original Date: 2010-01-11 Original Assignee: Bioperl Guts


Reported on the mailing list: http://lists.open-bio.org/pipermail/bioperl-l/2010-January/031889.html

I’m running bioperl-live from SVN, just updated to revision 16648.

$ perl -MBio::Root::Version -e ‘print $Bio::Root::Version::VERSION,“\n”’ 1.0069

I am trying to get Bio::SeqIO to convert a multiple record EMBL file into GenBank format, piping the data via stdin/stdout using the following trivial Perl script:

!/usr/bin/env perl

use Bio::SeqIO; my $in = Bio::SeqIO->new(-fh => *STDIN, -format => ‘embl’); my $out = Bio::SeqIO->new(-format => ‘genbank’); while (my $seq = $in->next_seq) { $out->write_seq($seq) };

This only seems to find the first EMBL record in my example files. For example, this simple file has just two contig records: http://biopython.open-bio.org/SRC/biopython/Tests/EMBL/Human\_contigs.embl

This is just the first two records taken from a much larger EMBL file rel_con_hum_01_r102.dat downloaded and uncompressed from: ftp://ftp.ebi.ac.uk/pub/databases/embl/release/rel_con_hum_01_r102.dat.gz

Trying both these examples as input, BioPerl just gives a single GenBank record as output (the first EMBL entry in the input).

According to Mark Jensen, this is probably due to the fact that these are EMBL contig records which have CO lines instead of SQ lines.

cjfields commented 8 years ago

Original Redmine Comment Author Name: Mark A. Jensen Original Date: 2010-01-11T10:40:40Z


Thanks Peter—

My prelim investigations point to following region in embl.pm (ll.423-441)

if ( $buffer =~ /^CO/ ) { until ( !defined ($buffer) ) { my $ftunit = $self->_read_FTHelper_EMBL(\$buffer);

process ftunit

push(@features, $ftunit->_generic_seqfeature($self->location_factory(), $name));

if ( $buffer !~ /^CO/ ) { last; } } } if ( $buffer !~ /^SQ/ ) { while ( defined ($_ = $self->_readline) ) { /^SQ/ && last; } }

MAJ

cjfields commented 8 years ago

Original Redmine Comment Author Name: Mark A. Jensen Original Date: 2010-01-16T16:48:31Z


Hi Peter, Please try now with r16704; if it works, pls report here—thx! patch, tests and test data added. MAJ

cjfields commented 8 years ago

Original Redmine Comment Author Name: Peter Cock Original Date: 2010-01-18T04:46:22Z


I updated to revision 16719, its looking much better Mark.

My original example does now return multiple records. However, there is a slight problem with the GenBank output of the CONTIG lines - they are indented, perhaps being treated like a feature?

e.g.

$ ../bioperl_embl2genbank.pl < Human_contigs.embl LOCUS AJ229040 0 bp genomic DNAlinear HUM 22-JAN-2004 … FEATURES Location/Qualifiers source 1..958952 /mol_type=“genomic DNA” /map=“q22” /db_xref=“taxon:9606” /chromosome=“21” /organism=“Homo sapiens” CONTIG join(AJ229041.1:1..323000,AJ229042.1:51..348050, AJ229043.1:51..288002)

// LOCUS AL954800 0 bp genomic DNAlinear HUM 28-JAN-2003 …

[Hopefully bugzilla won’t mangle the intents too much]

Consider this script, bioperl_genbank2genbank.pl instead:

!/usr/bin/env perl

use Bio::SeqIO; my $in = Bio::SeqIO->new(-fh => *STDIN, -format => ‘genbank’); my $out = Bio::SeqIO->new(-format => ‘genbank’); while (my $seq = $in->next_seq) { $out->write_seq($seq) };

Here is an example GenBank file with a contig instead of sequence: http://biopython.open-bio.org/SRC/biopython/Tests/GenBank/NT\_019265.gb

$ bioperl_genbank2genbank.pl < NT_019265.gb LOCUS NT_019265 1250660 bp DNA linear CON 16-OCT-2001 DEFINITION Homo sapiens chromosome 1 working draft sequence segment. ACCESSION NT_019265 … mRNA join(342430..342515,363171..363300,365741..365814, 376398..376499,390169..390297,391257..391379, 392606..392679,398230..398419,399082..399167, 399534..399650,405844..405913,406704..406761, 406868..407010,407962..408091,408508..409092) /db_xref=“LocusID:55735” /gene=“FLJ10737” /product=“hypothetical protein FLJ10737” /transcript_id=“XM_057697.1” CONTIG join(AL391218.9:105173..108462,gap(100), complement(AL512330.12:1..182490), complement(AL590128.4:9034..81287),gap(100), AL591163.7:85799..94832,gap(100),AL591163.7:94933..113245,gap(100), AL591163.7:42173..44897,complement(AL590128.4:1..6208), AL591163.7:51307..52779,gap(100),AL591163.7:52880..85698,gap(100), AL591163.7:113346..126143,complement(AL159177.12:184729..186047), AL031447.4:1..112158,complement(AL159177.12:1..72671), complement(AL591866.12:23507..86371),AL031848.11:1..142965, AL031847.17:1..166418,AL035406.25:1..161651, complement(AL356261.20:94599..98345), complement(AC026968.3:54432..54579),gap(100), complement(AC062024.2:98529..107911),gap(100), AC062024.2:7713..11594,gap(100),complement(AL356261.20:1..94498), complement(AL356693.18:19988..70853),gap(100), AL356693.18:17351..19887,gap(100), complement(AL356693.18:3037..17250),gap(100), complement(AL356693.18:1..2936),gap(100),AC026968.3:675..2393, gap(100),AC026968.3:1..574,gap(100),AL356261.20:179029..182233)

//

Here you can see the CONTIG line is not intended, and the continuations are indented by 12 spaces (good).

i.e. I think that parsing and writing of GenBank files with a CONTIG line works, but that parsing EMBL CO lines isn’t quite compatible with this.

Also output of EMBL CO lines isn’t working, consider a script bioperl_genbank2embl.pl like this:

!/usr/bin/env perl

use Bio::SeqIO; my $in = Bio::SeqIO->new(-fh => *STDIN, -format => ‘embl’); my $out = Bio::SeqIO->new(-format => ‘genbank’); while (my $seq = $in->next_seq) { $out->write_seq($seq) };

And then:

$ bioperl_genbank2embl.pl < NT_019265.gb ID NT_019265; SV 6; linear; unassigned DNA; STD; UNC; 1250660 BP. XX AC NT_019265; … FT /product=“hypothetical protein FLJ10737” FT /transcript_id=“XM_057697.1” XX SQ Sequence 1250660 BP; 0 A; 0 C; 0 G; 0 T; 1250660 other; //

No contig information (no CO lines) but instead an SQ line is given (but of course, no sequence).

cjfields commented 8 years ago

Original Redmine Comment Author Name: Mark A. Jensen Original Date: 2010-01-18T08:46:26Z


I knew it seemed too easy! Thanks Peter, will continue to hack away— MAJ

cjfields commented 8 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2010-01-18T09:14:29Z


Seems like an inconsistency between how the data is stored when coming from EMBL (as Feature) vs GenBank (as Annotation). Is that correct? I set up the storage into the latter as it pertains to the entire described region.

cjfields commented 8 years ago

Original Redmine Comment Author Name: Mark A. Jensen Original Date: 2010-01-18T09:46:56Z


Indeed, in embl.pm the CO lines are parsed as Features-

cjfields commented 8 years ago

Original Redmine Comment Author Name: Peter Cock Original Date: 2010-01-18T10:05:52Z


(In reply to comment #5)

Seems like an inconsistency between how the data is stored when coming from EMBL (as Feature) vs GenBank (as Annotation). Is that correct? I set up the storage into the latter as it pertains to the entire described region.

That fits the behaviour I observed. I would say the contig information applies to the entire sequence record (as done for GenBank support).

(In reply to comment #6)

Indeed, in embl.pm the CO lines are parsed as Features-

So should they be treated as Annotation to match the GenBank parser?

Peter

cjfields commented 8 years ago

Original Redmine Comment Author Name: Mark A. Jensen Original Date: 2010-01-21T08:35:06Z


Seems to DWYM when we create an annotation from the CO lines. I can commit later today, but would we like to make an ‘advanced’ option like -CONTIG_AS_FEATURE or -CONTIG_AS_ANNOT (depending on the default)? As we’ve established, the API is really ‘embl contigs as features’ and ‘genbank contigs as annots’.

(In reply to comment #7)

(In reply to comment #5)

Seems like an inconsistency between how the data is stored when coming from EMBL (as Feature) vs GenBank (as Annotation). Is that correct? I set up the storage into the latter as it pertains to the entire described region.

That fits the behaviour I observed. I would say the contig information applies to the entire sequence record (as done for GenBank support).

(In reply to comment #6)

Indeed, in embl.pm the CO lines are parsed as Features-

So should they be treated as Annotation to match the GenBank parser?

Peter

cjfields commented 8 years ago

Original Redmine Comment Author Name: Mark A. Jensen Original Date: 2010-01-21T08:44:20Z


P.S. — and the CO lines are found in the embl format feature table. So perhaps the right kludge should be in the genbank parser (i.e., contig found as features should be converted to annotations) ? (In reply to comment #8)

Seems to DWYM when we create an annotation from the CO lines. I can commit later today, but would we like to make an ‘advanced’ option like -CONTIG_AS_FEATURE or -CONTIG_AS_ANNOT (depending on the default)? As we’ve established, the API is really ‘embl contigs as features’ and ‘genbank contigs as annots’.

(In reply to comment #7)

(In reply to comment #5)

Seems like an inconsistency between how the data is stored when coming from EMBL (as Feature) vs GenBank (as Annotation). Is that correct? I set up the storage into the latter as it pertains to the entire described region.

That fits the behaviour I observed. I would say the contig information applies to the entire sequence record (as done for GenBank support).

(In reply to comment #6)

Indeed, in embl.pm the CO lines are parsed as Features-

So should they be treated as Annotation to match the GenBank parser?

Peter

cjfields commented 8 years ago

Original Redmine Comment Author Name: Peter Cock Original Date: 2010-01-21T09:55:56Z


(In reply to comment #9)

P.S. — and the CO lines are found in the embl format feature table.

What makes you say that? Looking at the example I referred to earlier, http://biopython.open-bio.org/SRC/biopython/Tests/EMBL/Human\_contigs.embl

You have various header lines, then the features, then the contig:

… FH Key Location/Qualifiers FH FT source 1..87191216 FT /organism=“Homo sapiens” FT /chromosome=“14” FT /mol_type=“genomic DNA” FT /db_xref=“taxon:9606” XX CO join(complement(AL512310.3:2094..185605), CO complement(AL391156.3:116..136599),complement(AL359218.4:329..156319), …

The contig information is on CO lines, not FT lines (feature table). i.e. The contig is stored separately from the features (just as in a GenBank file).

Peter

cjfields commented 8 years ago

Original Redmine Comment Author Name: Mark A. Jensen Original Date: 2010-01-21T10:14:50Z


ok, I see (learning embl as I go here). So in the parser, the CO lines were put in a feature ad hoc. I see two routes: leave embl.pm as-is and have genbank.pm look for ‘contig’ features, converting them to annotations, or change the embl.pm api and put CO lines into annotations.

cjfields commented 8 years ago

Original Redmine Comment Author Name: Peter Cock Original Date: 2010-01-21T10:27:35Z


(In reply to comment #11)

ok, I see (learning embl as I go here). So in the parser, the CO lines were put in a feature ad hoc.

That is my impression as an outsider.

I see two routes: leave embl.pm as-is and have genbank.pm look for ‘contig’ features, converting them to annotations, or change the embl.pm api and put CO lines into annotations.

Again, as an outsider, I would say “fix” the EMBL parser to store the contig in annotation (as done by the GenBank parser). And then fix the EMBL output code to match. But this is not my decision to make.

cjfields commented 8 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2010-01-21T12:25:48Z


(In reply to comment #12)

(In reply to comment #11)

ok, I see (learning embl as I go here). So in the parser, the CO lines were put in a feature ad hoc.

That is my impression as an outsider.

I see two routes: leave embl.pm as-is and have genbank.pm look for ‘contig’ features, converting them to annotations, or change the embl.pm api and put CO lines into annotations.

Again, as an outsider, I would say “fix” the EMBL parser to store the contig in annotation (as done by the GenBank parser). And then fix the EMBL output code to match. But this is not my decision to make.

Go ahead and do just that, for consistency. It fits better as annotation anyway, and will be easier to deal with.

Somewhat related to this, one thing that came up at the GMOD meeting is how we are addressing users needs re: Bio*. Consistency is key for proper file conversion, but the focus shouldn’t be only on conversion but how we’re actually using this data.

For instance, should we be thinking about converting this into some kind of coordinate system (say, Bio::Location via FTLcationFactory) for further use? Do users actually use this information, or just want the full-length file? And so on…

cjfields commented 8 years ago

Original Redmine Comment Author Name: Mark A. Jensen Original Date: 2010-01-21T15:28:44Z


OK— please try r16733 — cheers

cjfields commented 8 years ago

Original Redmine Comment Author Name: Peter Cock Original Date: 2010-01-22T05:46:02Z


(In reply to comment #14)

OK— please try r16733 — cheers

I updated to SVN revision 16736, and generally things look much improved.

I’ve only spotted one thing, which may be a GenBank output issue rather than an EMBL issue. Try this input file (although I think any GenBank file with a mutliline CONTIG entry would do): http://biopython.open-bio.org/SRC/biopython/Tests/GenBank/NT\_019265.gb

This ends like this:

… CONTIG join(AL391218.9:105173..108462,gap(100), complement(AL512330.12:1..182490), complement(AL590128.4:9034..81287),gap(100), AL591163.7:85799..94832,gap(100),AL591163.7:94933..113245,gap(100), AL591163.7:42173..44897,complement(AL590128.4:1..6208), AL591163.7:51307..52779,gap(100),AL591163.7:52880..85698,gap(100), AL591163.7:113346..126143,complement(AL159177.12:184729..186047), AL031447.4:1..112158,complement(AL159177.12:1..72671), complement(AL591866.12:23507..86371),AL031848.11:1..142965, AL031847.17:1..166418,AL035406.25:1..161651, complement(AL356261.20:94599..98345), complement(AC026968.3:54432..54579),gap(100), complement(AC062024.2:98529..107911),gap(100), AC062024.2:7713..11594,gap(100),complement(AL356261.20:1..94498), complement(AL356693.18:19988..70853),gap(100), AL356693.18:17351..19887,gap(100), complement(AL356693.18:3037..17250),gap(100), complement(AL356693.18:1..2936),gap(100),AC026968.3:675..2393, gap(100),AC026968.3:1..574,gap(100),AL356261.20:179029..182233) //

However, if we ask SeqIO to read and write this,

$ ./bioperl_genbank2genbank.pl < NT_019265.gb … CONTIG join(AL391218.9:105173..108462,gap(100),complement(AL512330.12:1..182490),complement(AL590128.4:9034..81287),gap(100),AL591163.7:85799..94832,gap(100),AL591163.7:94933..113245,gap(100),AL591163.7:42173..44897,complement(AL590128.4:1..6208),AL591163.7:51307..52779,gap(100),AL591163.7:52880..85698,gap(100),AL591163.7:113346..126143,complement(AL159177.12:184729..186047),AL031447.4:1..112158,complement(AL159177.12:1..72671),complement(AL591866.12:23507..86371),AL031848.11:1..142965,AL031847.17:1..166418,AL035406.25:1..161651,complement(AL356261.20:94599..98345),complement(AC026968.3:54432..54579),gap(100),complement(AC062024.2:98529..107911),gap(100),AC062024.2:7713..11594,gap(100),complement(AL356261.20:1..94498),complement(AL356693.18:19988..70853),gap(100),AL356693.18:17351..19887,gap(100),complement(AL356693.18:3037..17250),gap(100),complement(AL356693.18:1..2936),gap(100),AC026968.3:675..2393,gap(100),AC026968.3:1..574,gap(100),AL356261.20:179029..182233)

//

i.e. The contig line isn’t being line split on output.

Also, there seems to be a blank line between the contig and terminal // line.

This blank line is also seen in my EMBL to GenBank example, but here the contig lines look good:

./bioperl_embl2genbank.pl < Human_contigs.embl

I haven’t checked to see if the GenBank and EMBL parsers are consistent about storing the contig as a long string without newlines or not. This may help explain the occasional lack of line wrapping on output.

Peter