bioperl / bioperl-live-redmine

Legacy tickets migrated from the OBF Redmine issue tracker: http://redmine.open-bio.org
0 stars 0 forks source link

Bio::SeqIO GenBank files not entirely up to spec. #131

Open cjfields opened 8 years ago

cjfields commented 8 years ago

Author Name: Kai Blin (@kblin) Original Redmine Issue: 3325, https://redmine.open-bio.org/issues/3325 Original Date: 2012-02-10 Original Assignee: Kai Blin


While playing a bit with GenBank files, I noticed that our generator code doesn’t quite follow the GenBank release notes spec. E.g. a GenBank sequence always has to contain a “source” feature, and the last character in a DESCRIPTION field has to be a dot, our generated files ensure neither of this.

Nothing too serious, so opening a bug to remind me to fix it.

cjfields commented 8 years ago

Original Redmine Comment Author Name: Rodolfo Aramayo Original Date: 2012-06-26T04:59:16Z


When I run the following script:

#!/usr/bin/perl -w                                                                                                                                                                                        
use strict;
use Bio::SeqIO;

my $informat  = 'genbank';
my $outformat = 'genbank';

my $in = Bio::SeqIO->new
(
    -format => $informat,
    -file => './ncrassa_wt_LinkageGroup07.gbk'
);

my $out = Bio::SeqIO->new
(
 -format => $outformat,
 -file => '>./ncrassa_wt_LinkageGroup07.gbk.gbk');

while (my $seq = $in->next_seq)
{
 $out->write_seq($seq);
}

I get the following (displayed using: diff -y ncrassa_wt_LinkageGroup07.gbk ncrassa_wt_LinkageGroup07.gbk.gbk | less):

LOCUS       LinkageGroup_7       4255303 bp    DNA     linear | LOCUS       LinkageGroup_7       4255303 bp    DNA     linear
DEFINITION  Neurospora crassa strain OR74A chromosome Linkage   DEFINITION  Neurospora crassa strain OR74A chromosome Linkage
            LinkageGroup_7, whole genome shotgun sequence.                  LinkageGroup_7, whole genome shotgun sequence.
ACCESSION                                                     | ACCESSION   unknown
VERSION                                                       <
KEYWORDS    WGS.                                                KEYWORDS    WGS.
SOURCE      Neurospora crassa                                   SOURCE      Neurospora crassa
  ORGANISM  Neurospora crassa                                     ORGANISM  Neurospora crassa
            Unclassified.                                                   Unclassified.
REFERENCE   1  (bases 1 to 4255303)                             REFERENCE   1  (bases 1 to 4255303)
  AUTHORS   Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs     AUTHORS   Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs
            Alvarado,L., Berlin,A., Bochicchio,J., Borenstein |             Alvarado,L., Berlin,A., Bochicchio,J., Borenstein
            Chen,Z., Engels,R., Freedman,E., Gellesch,M., Gol |             Chapman,S.B., Chen,Z., Engels,R., Freedman,E., Ge
            Griggs,A., Gujja,S., Heilman,E., Heiman,D., Hepbu |             Goldberg,J., Griggs,A., Gujja,S., Heilman,E., Hei
            Jen,D., Larson,L., Lewis,B., Mehta,T., Park,D., P |             Hepburn,T., Howarth,C., Jen,D., Larson,L., Lewis,
            Richards,J., Roberts,A., Saif,S., Shea,T., Shenoy |             Park,D., Pearson,M., Richards,J., Roberts,A., Sai
            Stolte,C., Sykes,S., Thomson,T., Walk,T., White,J |             Shenoy,N., Sisk,P., Stolte,C., Sykes,S., Thomson,
            Haas,B., Nusbaum,C. and Birren,B.                 |             White,J., Yandava,C., Haas,B., Nusbaum,C. and Bir
  CONSRTM   The Broad Institute Genome Sequencing Platform        CONSRTM   The Broad Institute Genome Sequencing Platform
  TITLE     The Genome Sequence of Neurospora crassa strain O     TITLE     The Genome Sequence of Neurospora crassa strain O
  JOURNAL   Unpublished                                           JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 4255303)                             REFERENCE   2  (bases 1 to 4255303)
  AUTHORS   Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs     AUTHORS   Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs
            Alvarado,L., Berlin,A., Bochicchio,J., Borenstein |             Alvarado,L., Berlin,A., Bochicchio,J., Borenstein
            Chen,Z., Engels,R., Freedman,E., Gellesch,M., Gol |             Chapman,S.B., Chen,Z., Engels,R., Freedman,E., Ge
            Griggs,A., Gujja,S., Heilman,E., Heiman,D., Hepbu |             Goldberg,J., Griggs,A., Gujja,S., Heilman,E., Hei
            Jen,D., Larson,L., Lewis,B., Mehta,T., Park,D., P |             Hepburn,T., Howarth,C., Jen,D., Larson,L., Lewis,
            Richards,J., Roberts,A., Saif,S., Shea,T., Shenoy |             Park,D., Pearson,M., Richards,J., Roberts,A., Sai
            Stolte,C., Sykes,S., Thomson,T., Walk,T., White,J |             Shenoy,N., Sisk,P., Stolte,C., Sykes,S., Thomson,
            Haas,B., Nusbaum,C. and Birren,B.                 |             White,J., Yandava,C., Haas,B., Nusbaum,C. and Bir
  CONSRTM   The Broad Institute Genome Sequencing Platform    |   CONSRTM   The Broad Institute Genome Sequencing Platform Th
                                                              >             Genome Sequencing Platform
  TITLE     Direct Submission                                     TITLE     Direct Submission
  JOURNAL   Submitted (02-MAR-2011) Broad Institute of MIT an     JOURNAL   Submitted (02-MAR-2011) Broad Institute of MIT an
            Cambridge Center, Cambridge, MA 02142, USA                      Cambridge Center, Cambridge, MA 02142, USA
FEATURES             Location/Qualifiers                        FEATURES             Location/Qualifiers
     source          1..4255303                                      source          1..4255303
                     /organism="Neurospora crassa"            <
                     /mol_type="genomic DNA"                                         /mol_type="genomic DNA"
                     /strain="OR74A"                                                 /strain="OR74A"
                     /chromosome="Linkage Group VII"                                 /chromosome="Linkage Group VII"
                                                              >                      /organism="Neurospora crassa"

and

     CDS             complement(33866..34930)                        CDS             complement(33866..34930)
                     /locus_tag="NCU05900"                                           /locus_tag="NCU05900"
                     /codon_start=1                                                  /codon_start=1
                     /product="hypothetical protein"          <
                     /protein_id="WGS:AABX:NCU05900T0"                               /protein_id="WGS:AABX:NCU05900T0"
                     /translation="MSKSPHVSPDVRSSPPDLLPPPSYTE                        /translation="MSKSPHVSPDVRSSPPDLLPPPSYTE
                     VPIPTRTGESPLTTHLRTIPSRLRSAQHSHSTAQSSRDAF                        VPIPTRTGESPLTTHLRTIPSRLRSAQHSHSTAQSSRDAF
                     MPKTPKVAELVLVPTEGLPGVESSEGATSTGKELARKRAE                        MPKTPKVAELVLVPTEGLPGVESSEGATSTGKELARKRAE
                     EVVKVVCVSAAPDQGRQVTDEKGRTVDRKRAGEDSGSGSA                        EVVKVVCVSAAPDQGRQVTDEKGRTVDRKRAGEDSGSGSA
                     TSPSSSRSQFGPEEAWNWFATPTLARRIASLLRPEPTLAR                        TSPSSSRSQFGPEEAWNWFATPTLARRIASLLRPEPTLAR
                     KKSGFGSFFRRTSKSEPQTPTPTERLLTPVRDGAMLEQDR                        KKSGFGSFFRRTSKSEPQTPTPTERLLTPVRDGAMLEQDR
                     GLWESVGGWGVVVSIKVGRL"                                           GLWESVGGWGVVVSIKVGRL"
                                                              >                      /product="hypothetical protein"
     gene            36560..38546                                    gene            36560..38546
                     /locus_tag="NCU05899"                                           /locus_tag="NCU05899"
     mRNA            join(36560..36666,36734..38546)                 mRNA            join(36560..36666,36734..38546)
                     /locus_tag="NCU05899"                                           /locus_tag="NCU05899"
                     /product="flotillin domain-containing pr                        /product="flotillin domain-containing pr
                     /transcript_id="WGS:AABX:mrna_NCU05899T0                        /transcript_id="WGS:AABX:mrna_NCU05899T0
     CDS             join(36600..36666,36734..38244)                 CDS             join(36600..36666,36734..38244)
                     /locus_tag="NCU05899"                                           /locus_tag="NCU05899"
                     /codon_start=1                                                  /codon_start=1
                     /product="flotillin domain-containing pr <
                     /protein_id="WGS:AABX:NCU05899T0"                               /protein_id="WGS:AABX:NCU05899T0"
                     /translation="MASYKIAAPDEYLAITGMGVKTLKIT                        /translation="MASYKIAAPDEYLAITGMGVKTLKIT
                     HDYAMSLQAMTKEKLQFLLPVVFTVGPDVNQRGANIRMFH                        HDYAMSLQAMTKEKLQFLLPVVFTVGPDVNQRGANIRMFH
                     SAVRREDRGDALMKFAMLLADSGRDKGPNNHDFLEGIVKG                        SAVRREDRGDALMKFAMLLADSGRDKGPNNHDFLEGIVKG
                     FSEREVFKRRIFRNIQSELDQFGLKIYNANVKELKDAPGS                        FSEREVFKRRIFRNIQSELDQFGLKIYNANVKELKDAPGS
                     RIDVAEAQLRGNVGTQKRKGEEAREVAKIQGEQDRELAKI                        RIDVAEAQLRGNVGTQKRKGEEAREVAKIQGEQDRELAKI
                     EALLKTRQVELDRDVQIAGIQAARNTEAEDETLKREVQIK                        EALLKTRQVELDRDVQIAGIQAARNTEAEDETLKREVQIK
                     IAREAKQQAADAKAYEIEKEAQANYEKAKQHTEADVYETK                        IAREAKQQAADAKAYEIEKEAQANYEKAKQHTEADVYETK
                     QQKLRAAEGMSAMAEAYAKMSHAFGGPQGLLQYMMIEKGT                        QQKLRAAEGMSAMAEAYAKMSHAFGGPQGLLQYMMIEKGT
                     ISVWNTGAEAGSSGGAGEQQSSMATMRNIYQMLPPLMTTI                        ISVWNTGAEAGSSGGAGEQQSSMATMRNIYQMLPPLMTTI
                     QMSEVERRGQASNGQKE"                                              QMSEVERRGQASNGQKE"
                                                              >                      /product="flotillin domain-containing pr
     gene            52953..55087                                    gene            52953..55087

This is the “/product=” jumps to the end of the sequence

cjfields commented 8 years ago

Original Redmine Comment Author Name: Rodolfo Aramayo Original Date: 2012-06-26T05:05:12Z


APOLOGIES for previous formatting…first report…

When I run the following script:

@ #!/usr/bin/perl -w use strict; use Bio::SeqIO;

my $informat = ‘genbank’; my $outformat = ‘genbank’;

my $in = Bio::SeqIO->new ( -format => $informat, -file => ‘./ncrassa_wt_LinkageGroup07.gbk’ );

my $out = Bio::SeqIO->new ( -format => $outformat, -file => ‘>./ncrassa_wt_LinkageGroup07.gbk.gbk’);

while (my $seq = $in->next_seq) { $out->write_seq($seq); }@

I get the following (displayed using: diff -y ncrassa_wt_LinkageGroup07.gbk ncrassa_wt_LinkageGroup07.gbk.gbk | less):

`LOCUS LinkageGroup_7 4255303 bp DNA linear | LOCUS LinkageGroup_7 4255303 bp DNA linear DEFINITION Neurospora crassa strain OR74A chromosome Linkage DEFINITION Neurospora crassa strain OR74A chromosome Linkage LinkageGroup_7, whole genome shotgun sequence. LinkageGroup_7, whole genome shotgun sequence. ACCESSION | ACCESSION unknown VERSION < KEYWORDS WGS. KEYWORDS WGS. SOURCE Neurospora crassa SOURCE Neurospora crassa ORGANISM Neurospora crassa ORGANISM Neurospora crassa Unclassified. Unclassified. REFERENCE 1 (bases 1 to 4255303) REFERENCE 1 (bases 1 to 4255303) AUTHORS Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs AUTHORS Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs Alvarado,L., Berlin,A., Bochicchio,J., Borenstein | Alvarado,L., Berlin,A., Bochicchio,J., Borenstein Chen,Z., Engels,R., Freedman,E., Gellesch,M., Gol | Chapman,S.B., Chen,Z., Engels,R., Freedman,E., Ge Griggs,A., Gujja,S., Heilman,E., Heiman,D., Hepbu | Goldberg,J., Griggs,A., Gujja,S., Heilman,E., Hei Jen,D., Larson,L., Lewis,B., Mehta,T., Park,D., P | Hepburn,T., Howarth,C., Jen,D., Larson,L., Lewis, Richards,J., Roberts,A., Saif,S., Shea,T., Shenoy | Park,D., Pearson,M., Richards,J., Roberts,A., Sai Stolte,C., Sykes,S., Thomson,T., Walk,T., White,J | Shenoy,N., Sisk,P., Stolte,C., Sykes,S., Thomson, Haas,B., Nusbaum,C. and Birren,B. | White,J., Yandava,C., Haas,B., Nusbaum,C. and Bir CONSRTM The Broad Institute Genome Sequencing Platform CONSRTM The Broad Institute Genome Sequencing Platform TITLE The Genome Sequence of Neurospora crassa strain O TITLE The Genome Sequence of Neurospora crassa strain O JOURNAL Unpublished JOURNAL Unpublished REFERENCE 2 (bases 1 to 4255303) REFERENCE 2 (bases 1 to 4255303) AUTHORS Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs AUTHORS Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs Alvarado,L., Berlin,A., Bochicchio,J., Borenstein | Alvarado,L., Berlin,A., Bochicchio,J., Borenstein Chen,Z., Engels,R., Freedman,E., Gellesch,M., Gol | Chapman,S.B., Chen,Z., Engels,R., Freedman,E., Ge Griggs,A., Gujja,S., Heilman,E., Heiman,D., Hepbu | Goldberg,J., Griggs,A., Gujja,S., Heilman,E., Hei Jen,D., Larson,L., Lewis,B., Mehta,T., Park,D., P | Hepburn,T., Howarth,C., Jen,D., Larson,L., Lewis, Richards,J., Roberts,A., Saif,S., Shea,T., Shenoy | Park,D., Pearson,M., Richards,J., Roberts,A., Sai Stolte,C., Sykes,S., Thomson,T., Walk,T., White,J | Shenoy,N., Sisk,P., Stolte,C., Sykes,S., Thomson, Haas,B., Nusbaum,C. and Birren,B. | White,J., Yandava,C., Haas,B., Nusbaum,C. and Bir CONSRTM The Broad Institute Genome Sequencing Platform | CONSRTM The Broad Institute Genome Sequencing Platform Th

        Genome Sequencing Platform

TITLE Direct Submission TITLE Direct Submission JOURNAL Submitted (02-MAR-2011) Broad Institute of MIT an JOURNAL Submitted (02-MAR-2011) Broad Institute of MIT an Cambridge Center, Cambridge, MA 02142, USA Cambridge Center, Cambridge, MA 02142, USA FEATURES Location/Qualifiers FEATURES Location/Qualifiers source 1..4255303 source 1..4255303 /organism="Neurospora crassa" < /mol_type="genomic DNA" /mol_type="genomic DNA" /strain="OR74A" /strain="OR74A" /chromosome="Linkage Group VII" /chromosome="Linkage Group VII" /organism="Neurospora crassa"`

and

`CDS complement(33866..34930) CDS complement(33866..34930) /locus_tag="NCU05900" /locus_tag="NCU05900" /codon_start=1 /codon_start=1 /product="hypothetical protein" < /protein_id="WGS:AABX:NCU05900T0" /protein_id="WGS:AABX:NCU05900T0" /translation="MSKSPHVSPDVRSSPPDLLPPPSYTE /translation="MSKSPHVSPDVRSSPPDLLPPPSYTE VPIPTRTGESPLTTHLRTIPSRLRSAQHSHSTAQSSRDAF VPIPTRTGESPLTTHLRTIPSRLRSAQHSHSTAQSSRDAF MPKTPKVAELVLVPTEGLPGVESSEGATSTGKELARKRAE MPKTPKVAELVLVPTEGLPGVESSEGATSTGKELARKRAE EVVKVVCVSAAPDQGRQVTDEKGRTVDRKRAGEDSGSGSA EVVKVVCVSAAPDQGRQVTDEKGRTVDRKRAGEDSGSGSA TSPSSSRSQFGPEEAWNWFATPTLARRIASLLRPEPTLAR TSPSSSRSQFGPEEAWNWFATPTLARRIASLLRPEPTLAR KKSGFGSFFRRTSKSEPQTPTPTERLLTPVRDGAMLEQDR KKSGFGSFFRRTSKSEPQTPTPTERLLTPVRDGAMLEQDR GLWESVGGWGVVVSIKVGRL" GLWESVGGWGVVVSIKVGRL"

                 /product="hypothetical protein"

gene 36560..38546 gene 36560..38546 /locus_tag="NCU05899" /locus_tag="NCU05899" mRNA join(36560..36666,36734..38546) mRNA join(36560..36666,36734..38546) /locus_tag="NCU05899" /locus_tag="NCU05899" /product="flotillin domain-containing pr /product="flotillin domain-containing pr /transcript_id="WGS:AABX:mrna_NCU05899T0 /transcript_id="WGS:AABX:mrna_NCU05899T0 CDS join(36600..36666,36734..38244) CDS join(36600..36666,36734..38244) /locus_tag="NCU05899" /locus_tag="NCU05899" /codon_start=1 /codon_start=1 /product="flotillin domain-containing pr < /protein_id="WGS:AABX:NCU05899T0" /protein_id="WGS:AABX:NCU05899T0" /translation="MASYKIAAPDEYLAITGMGVKTLKIT /translation="MASYKIAAPDEYLAITGMGVKTLKIT HDYAMSLQAMTKEKLQFLLPVVFTVGPDVNQRGANIRMFH HDYAMSLQAMTKEKLQFLLPVVFTVGPDVNQRGANIRMFH SAVRREDRGDALMKFAMLLADSGRDKGPNNHDFLEGIVKG SAVRREDRGDALMKFAMLLADSGRDKGPNNHDFLEGIVKG FSEREVFKRRIFRNIQSELDQFGLKIYNANVKELKDAPGS FSEREVFKRRIFRNIQSELDQFGLKIYNANVKELKDAPGS RIDVAEAQLRGNVGTQKRKGEEAREVAKIQGEQDRELAKI RIDVAEAQLRGNVGTQKRKGEEAREVAKIQGEQDRELAKI EALLKTRQVELDRDVQIAGIQAARNTEAEDETLKREVQIK EALLKTRQVELDRDVQIAGIQAARNTEAEDETLKREVQIK IAREAKQQAADAKAYEIEKEAQANYEKAKQHTEADVYETK IAREAKQQAADAKAYEIEKEAQANYEKAKQHTEADVYETK QQKLRAAEGMSAMAEAYAKMSHAFGGPQGLLQYMMIEKGT QQKLRAAEGMSAMAEAYAKMSHAFGGPQGLLQYMMIEKGT ISVWNTGAEAGSSGGAGEQQSSMATMRNIYQMLPPLMTTI ISVWNTGAEAGSSGGAGEQQSSMATMRNIYQMLPPLMTTI QMSEVERRGQASNGQKE" QMSEVERRGQASNGQKE" /product="flotillin domain-containing pr gene 52953..55087 gene 52953..55087`

This is the “/product=” jumps to the end of the sequence

cjfields commented 8 years ago

Original Redmine Comment Author Name: Kai Blin Original Date: 2012-07-03T14:24:15Z


I don’t think our SeqIO parsers/writers promise to keep the file unchanged. As long as we don’t lose information, I’m not too concerned. Assuming that the characters missing from the right side of both outputs are an artifact of the diff and not actually missing from the genbank file, I don’t see any data being lost. The duplicated CONSRTM output seems worth looking into. I’ll open a separate bug report for this if I manage to reproduce this.

cjfields commented 8 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2012-07-17T19:33:55Z


We do not guarantee reproduciblity with I/O, only that the data itself is maintained. FWIW, output has never been a primary focus, only that data is accurately parsed and accessible in a reproducible manner across formats (having decent output is more a side effect). We do try to support it when we can and when it is realistic to do so w/o affecting our (admittedly terrible) performance.