Open cjfields opened 8 years ago
Original Redmine Comment Author Name: Rodolfo Aramayo Original Date: 2012-06-26T04:59:16Z
When I run the following script:
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
my $informat = 'genbank';
my $outformat = 'genbank';
my $in = Bio::SeqIO->new
(
-format => $informat,
-file => './ncrassa_wt_LinkageGroup07.gbk'
);
my $out = Bio::SeqIO->new
(
-format => $outformat,
-file => '>./ncrassa_wt_LinkageGroup07.gbk.gbk');
while (my $seq = $in->next_seq)
{
$out->write_seq($seq);
}
I get the following (displayed using: diff -y ncrassa_wt_LinkageGroup07.gbk ncrassa_wt_LinkageGroup07.gbk.gbk | less):
LOCUS LinkageGroup_7 4255303 bp DNA linear | LOCUS LinkageGroup_7 4255303 bp DNA linear
DEFINITION Neurospora crassa strain OR74A chromosome Linkage DEFINITION Neurospora crassa strain OR74A chromosome Linkage
LinkageGroup_7, whole genome shotgun sequence. LinkageGroup_7, whole genome shotgun sequence.
ACCESSION | ACCESSION unknown
VERSION <
KEYWORDS WGS. KEYWORDS WGS.
SOURCE Neurospora crassa SOURCE Neurospora crassa
ORGANISM Neurospora crassa ORGANISM Neurospora crassa
Unclassified. Unclassified.
REFERENCE 1 (bases 1 to 4255303) REFERENCE 1 (bases 1 to 4255303)
AUTHORS Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs AUTHORS Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs
Alvarado,L., Berlin,A., Bochicchio,J., Borenstein | Alvarado,L., Berlin,A., Bochicchio,J., Borenstein
Chen,Z., Engels,R., Freedman,E., Gellesch,M., Gol | Chapman,S.B., Chen,Z., Engels,R., Freedman,E., Ge
Griggs,A., Gujja,S., Heilman,E., Heiman,D., Hepbu | Goldberg,J., Griggs,A., Gujja,S., Heilman,E., Hei
Jen,D., Larson,L., Lewis,B., Mehta,T., Park,D., P | Hepburn,T., Howarth,C., Jen,D., Larson,L., Lewis,
Richards,J., Roberts,A., Saif,S., Shea,T., Shenoy | Park,D., Pearson,M., Richards,J., Roberts,A., Sai
Stolte,C., Sykes,S., Thomson,T., Walk,T., White,J | Shenoy,N., Sisk,P., Stolte,C., Sykes,S., Thomson,
Haas,B., Nusbaum,C. and Birren,B. | White,J., Yandava,C., Haas,B., Nusbaum,C. and Bir
CONSRTM The Broad Institute Genome Sequencing Platform CONSRTM The Broad Institute Genome Sequencing Platform
TITLE The Genome Sequence of Neurospora crassa strain O TITLE The Genome Sequence of Neurospora crassa strain O
JOURNAL Unpublished JOURNAL Unpublished
REFERENCE 2 (bases 1 to 4255303) REFERENCE 2 (bases 1 to 4255303)
AUTHORS Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs AUTHORS Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs
Alvarado,L., Berlin,A., Bochicchio,J., Borenstein | Alvarado,L., Berlin,A., Bochicchio,J., Borenstein
Chen,Z., Engels,R., Freedman,E., Gellesch,M., Gol | Chapman,S.B., Chen,Z., Engels,R., Freedman,E., Ge
Griggs,A., Gujja,S., Heilman,E., Heiman,D., Hepbu | Goldberg,J., Griggs,A., Gujja,S., Heilman,E., Hei
Jen,D., Larson,L., Lewis,B., Mehta,T., Park,D., P | Hepburn,T., Howarth,C., Jen,D., Larson,L., Lewis,
Richards,J., Roberts,A., Saif,S., Shea,T., Shenoy | Park,D., Pearson,M., Richards,J., Roberts,A., Sai
Stolte,C., Sykes,S., Thomson,T., Walk,T., White,J | Shenoy,N., Sisk,P., Stolte,C., Sykes,S., Thomson,
Haas,B., Nusbaum,C. and Birren,B. | White,J., Yandava,C., Haas,B., Nusbaum,C. and Bir
CONSRTM The Broad Institute Genome Sequencing Platform | CONSRTM The Broad Institute Genome Sequencing Platform Th
> Genome Sequencing Platform
TITLE Direct Submission TITLE Direct Submission
JOURNAL Submitted (02-MAR-2011) Broad Institute of MIT an JOURNAL Submitted (02-MAR-2011) Broad Institute of MIT an
Cambridge Center, Cambridge, MA 02142, USA Cambridge Center, Cambridge, MA 02142, USA
FEATURES Location/Qualifiers FEATURES Location/Qualifiers
source 1..4255303 source 1..4255303
/organism="Neurospora crassa" <
/mol_type="genomic DNA" /mol_type="genomic DNA"
/strain="OR74A" /strain="OR74A"
/chromosome="Linkage Group VII" /chromosome="Linkage Group VII"
> /organism="Neurospora crassa"
and
CDS complement(33866..34930) CDS complement(33866..34930)
/locus_tag="NCU05900" /locus_tag="NCU05900"
/codon_start=1 /codon_start=1
/product="hypothetical protein" <
/protein_id="WGS:AABX:NCU05900T0" /protein_id="WGS:AABX:NCU05900T0"
/translation="MSKSPHVSPDVRSSPPDLLPPPSYTE /translation="MSKSPHVSPDVRSSPPDLLPPPSYTE
VPIPTRTGESPLTTHLRTIPSRLRSAQHSHSTAQSSRDAF VPIPTRTGESPLTTHLRTIPSRLRSAQHSHSTAQSSRDAF
MPKTPKVAELVLVPTEGLPGVESSEGATSTGKELARKRAE MPKTPKVAELVLVPTEGLPGVESSEGATSTGKELARKRAE
EVVKVVCVSAAPDQGRQVTDEKGRTVDRKRAGEDSGSGSA EVVKVVCVSAAPDQGRQVTDEKGRTVDRKRAGEDSGSGSA
TSPSSSRSQFGPEEAWNWFATPTLARRIASLLRPEPTLAR TSPSSSRSQFGPEEAWNWFATPTLARRIASLLRPEPTLAR
KKSGFGSFFRRTSKSEPQTPTPTERLLTPVRDGAMLEQDR KKSGFGSFFRRTSKSEPQTPTPTERLLTPVRDGAMLEQDR
GLWESVGGWGVVVSIKVGRL" GLWESVGGWGVVVSIKVGRL"
> /product="hypothetical protein"
gene 36560..38546 gene 36560..38546
/locus_tag="NCU05899" /locus_tag="NCU05899"
mRNA join(36560..36666,36734..38546) mRNA join(36560..36666,36734..38546)
/locus_tag="NCU05899" /locus_tag="NCU05899"
/product="flotillin domain-containing pr /product="flotillin domain-containing pr
/transcript_id="WGS:AABX:mrna_NCU05899T0 /transcript_id="WGS:AABX:mrna_NCU05899T0
CDS join(36600..36666,36734..38244) CDS join(36600..36666,36734..38244)
/locus_tag="NCU05899" /locus_tag="NCU05899"
/codon_start=1 /codon_start=1
/product="flotillin domain-containing pr <
/protein_id="WGS:AABX:NCU05899T0" /protein_id="WGS:AABX:NCU05899T0"
/translation="MASYKIAAPDEYLAITGMGVKTLKIT /translation="MASYKIAAPDEYLAITGMGVKTLKIT
HDYAMSLQAMTKEKLQFLLPVVFTVGPDVNQRGANIRMFH HDYAMSLQAMTKEKLQFLLPVVFTVGPDVNQRGANIRMFH
SAVRREDRGDALMKFAMLLADSGRDKGPNNHDFLEGIVKG SAVRREDRGDALMKFAMLLADSGRDKGPNNHDFLEGIVKG
FSEREVFKRRIFRNIQSELDQFGLKIYNANVKELKDAPGS FSEREVFKRRIFRNIQSELDQFGLKIYNANVKELKDAPGS
RIDVAEAQLRGNVGTQKRKGEEAREVAKIQGEQDRELAKI RIDVAEAQLRGNVGTQKRKGEEAREVAKIQGEQDRELAKI
EALLKTRQVELDRDVQIAGIQAARNTEAEDETLKREVQIK EALLKTRQVELDRDVQIAGIQAARNTEAEDETLKREVQIK
IAREAKQQAADAKAYEIEKEAQANYEKAKQHTEADVYETK IAREAKQQAADAKAYEIEKEAQANYEKAKQHTEADVYETK
QQKLRAAEGMSAMAEAYAKMSHAFGGPQGLLQYMMIEKGT QQKLRAAEGMSAMAEAYAKMSHAFGGPQGLLQYMMIEKGT
ISVWNTGAEAGSSGGAGEQQSSMATMRNIYQMLPPLMTTI ISVWNTGAEAGSSGGAGEQQSSMATMRNIYQMLPPLMTTI
QMSEVERRGQASNGQKE" QMSEVERRGQASNGQKE"
> /product="flotillin domain-containing pr
gene 52953..55087 gene 52953..55087
This is the “/product=” jumps to the end of the sequence
Original Redmine Comment Author Name: Rodolfo Aramayo Original Date: 2012-06-26T05:05:12Z
APOLOGIES for previous formatting…first report…
When I run the following script:
@ #!/usr/bin/perl -w use strict; use Bio::SeqIO;
my $informat = ‘genbank’; my $outformat = ‘genbank’;
my $in = Bio::SeqIO->new ( -format => $informat, -file => ‘./ncrassa_wt_LinkageGroup07.gbk’ );
my $out = Bio::SeqIO->new ( -format => $outformat, -file => ‘>./ncrassa_wt_LinkageGroup07.gbk.gbk’);
while (my $seq = $in->next_seq) { $out->write_seq($seq); }@
I get the following (displayed using: diff -y ncrassa_wt_LinkageGroup07.gbk ncrassa_wt_LinkageGroup07.gbk.gbk | less):
`LOCUS LinkageGroup_7 4255303 bp DNA linear | LOCUS LinkageGroup_7 4255303 bp DNA linear DEFINITION Neurospora crassa strain OR74A chromosome Linkage DEFINITION Neurospora crassa strain OR74A chromosome Linkage LinkageGroup_7, whole genome shotgun sequence. LinkageGroup_7, whole genome shotgun sequence. ACCESSION | ACCESSION unknown VERSION < KEYWORDS WGS. KEYWORDS WGS. SOURCE Neurospora crassa SOURCE Neurospora crassa ORGANISM Neurospora crassa ORGANISM Neurospora crassa Unclassified. Unclassified. REFERENCE 1 (bases 1 to 4255303) REFERENCE 1 (bases 1 to 4255303) AUTHORS Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs AUTHORS Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs Alvarado,L., Berlin,A., Bochicchio,J., Borenstein | Alvarado,L., Berlin,A., Bochicchio,J., Borenstein Chen,Z., Engels,R., Freedman,E., Gellesch,M., Gol | Chapman,S.B., Chen,Z., Engels,R., Freedman,E., Ge Griggs,A., Gujja,S., Heilman,E., Heiman,D., Hepbu | Goldberg,J., Griggs,A., Gujja,S., Heilman,E., Hei Jen,D., Larson,L., Lewis,B., Mehta,T., Park,D., P | Hepburn,T., Howarth,C., Jen,D., Larson,L., Lewis, Richards,J., Roberts,A., Saif,S., Shea,T., Shenoy | Park,D., Pearson,M., Richards,J., Roberts,A., Sai Stolte,C., Sykes,S., Thomson,T., Walk,T., White,J | Shenoy,N., Sisk,P., Stolte,C., Sykes,S., Thomson, Haas,B., Nusbaum,C. and Birren,B. | White,J., Yandava,C., Haas,B., Nusbaum,C. and Bir CONSRTM The Broad Institute Genome Sequencing Platform CONSRTM The Broad Institute Genome Sequencing Platform TITLE The Genome Sequence of Neurospora crassa strain O TITLE The Genome Sequence of Neurospora crassa strain O JOURNAL Unpublished JOURNAL Unpublished REFERENCE 2 (bases 1 to 4255303) REFERENCE 2 (bases 1 to 4255303) AUTHORS Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs AUTHORS Galagan,J., Sachs,M., Young,S.K., Zeng,Q., Koehrs Alvarado,L., Berlin,A., Bochicchio,J., Borenstein | Alvarado,L., Berlin,A., Bochicchio,J., Borenstein Chen,Z., Engels,R., Freedman,E., Gellesch,M., Gol | Chapman,S.B., Chen,Z., Engels,R., Freedman,E., Ge Griggs,A., Gujja,S., Heilman,E., Heiman,D., Hepbu | Goldberg,J., Griggs,A., Gujja,S., Heilman,E., Hei Jen,D., Larson,L., Lewis,B., Mehta,T., Park,D., P | Hepburn,T., Howarth,C., Jen,D., Larson,L., Lewis, Richards,J., Roberts,A., Saif,S., Shea,T., Shenoy | Park,D., Pearson,M., Richards,J., Roberts,A., Sai Stolte,C., Sykes,S., Thomson,T., Walk,T., White,J | Shenoy,N., Sisk,P., Stolte,C., Sykes,S., Thomson, Haas,B., Nusbaum,C. and Birren,B. | White,J., Yandava,C., Haas,B., Nusbaum,C. and Bir CONSRTM The Broad Institute Genome Sequencing Platform | CONSRTM The Broad Institute Genome Sequencing Platform Th
Genome Sequencing Platform
TITLE Direct Submission TITLE Direct Submission JOURNAL Submitted (02-MAR-2011) Broad Institute of MIT an JOURNAL Submitted (02-MAR-2011) Broad Institute of MIT an Cambridge Center, Cambridge, MA 02142, USA Cambridge Center, Cambridge, MA 02142, USA FEATURES Location/Qualifiers FEATURES Location/Qualifiers source 1..4255303 source 1..4255303 /organism="Neurospora crassa" < /mol_type="genomic DNA" /mol_type="genomic DNA" /strain="OR74A" /strain="OR74A" /chromosome="Linkage Group VII" /chromosome="Linkage Group VII" /organism="Neurospora crassa"`
and
`CDS complement(33866..34930) CDS complement(33866..34930) /locus_tag="NCU05900" /locus_tag="NCU05900" /codon_start=1 /codon_start=1 /product="hypothetical protein" < /protein_id="WGS:AABX:NCU05900T0" /protein_id="WGS:AABX:NCU05900T0" /translation="MSKSPHVSPDVRSSPPDLLPPPSYTE /translation="MSKSPHVSPDVRSSPPDLLPPPSYTE VPIPTRTGESPLTTHLRTIPSRLRSAQHSHSTAQSSRDAF VPIPTRTGESPLTTHLRTIPSRLRSAQHSHSTAQSSRDAF MPKTPKVAELVLVPTEGLPGVESSEGATSTGKELARKRAE MPKTPKVAELVLVPTEGLPGVESSEGATSTGKELARKRAE EVVKVVCVSAAPDQGRQVTDEKGRTVDRKRAGEDSGSGSA EVVKVVCVSAAPDQGRQVTDEKGRTVDRKRAGEDSGSGSA TSPSSSRSQFGPEEAWNWFATPTLARRIASLLRPEPTLAR TSPSSSRSQFGPEEAWNWFATPTLARRIASLLRPEPTLAR KKSGFGSFFRRTSKSEPQTPTPTERLLTPVRDGAMLEQDR KKSGFGSFFRRTSKSEPQTPTPTERLLTPVRDGAMLEQDR GLWESVGGWGVVVSIKVGRL" GLWESVGGWGVVVSIKVGRL"
/product="hypothetical protein"
gene 36560..38546 gene 36560..38546 /locus_tag="NCU05899" /locus_tag="NCU05899" mRNA join(36560..36666,36734..38546) mRNA join(36560..36666,36734..38546) /locus_tag="NCU05899" /locus_tag="NCU05899" /product="flotillin domain-containing pr /product="flotillin domain-containing pr /transcript_id="WGS:AABX:mrna_NCU05899T0 /transcript_id="WGS:AABX:mrna_NCU05899T0 CDS join(36600..36666,36734..38244) CDS join(36600..36666,36734..38244) /locus_tag="NCU05899" /locus_tag="NCU05899" /codon_start=1 /codon_start=1 /product="flotillin domain-containing pr < /protein_id="WGS:AABX:NCU05899T0" /protein_id="WGS:AABX:NCU05899T0" /translation="MASYKIAAPDEYLAITGMGVKTLKIT /translation="MASYKIAAPDEYLAITGMGVKTLKIT HDYAMSLQAMTKEKLQFLLPVVFTVGPDVNQRGANIRMFH HDYAMSLQAMTKEKLQFLLPVVFTVGPDVNQRGANIRMFH SAVRREDRGDALMKFAMLLADSGRDKGPNNHDFLEGIVKG SAVRREDRGDALMKFAMLLADSGRDKGPNNHDFLEGIVKG FSEREVFKRRIFRNIQSELDQFGLKIYNANVKELKDAPGS FSEREVFKRRIFRNIQSELDQFGLKIYNANVKELKDAPGS RIDVAEAQLRGNVGTQKRKGEEAREVAKIQGEQDRELAKI RIDVAEAQLRGNVGTQKRKGEEAREVAKIQGEQDRELAKI EALLKTRQVELDRDVQIAGIQAARNTEAEDETLKREVQIK EALLKTRQVELDRDVQIAGIQAARNTEAEDETLKREVQIK IAREAKQQAADAKAYEIEKEAQANYEKAKQHTEADVYETK IAREAKQQAADAKAYEIEKEAQANYEKAKQHTEADVYETK QQKLRAAEGMSAMAEAYAKMSHAFGGPQGLLQYMMIEKGT QQKLRAAEGMSAMAEAYAKMSHAFGGPQGLLQYMMIEKGT ISVWNTGAEAGSSGGAGEQQSSMATMRNIYQMLPPLMTTI ISVWNTGAEAGSSGGAGEQQSSMATMRNIYQMLPPLMTTI QMSEVERRGQASNGQKE" QMSEVERRGQASNGQKE" /product="flotillin domain-containing pr gene 52953..55087 gene 52953..55087`
This is the “/product=” jumps to the end of the sequence
Original Redmine Comment Author Name: Kai Blin Original Date: 2012-07-03T14:24:15Z
I don’t think our SeqIO parsers/writers promise to keep the file unchanged. As long as we don’t lose information, I’m not too concerned. Assuming that the characters missing from the right side of both outputs are an artifact of the diff and not actually missing from the genbank file, I don’t see any data being lost. The duplicated CONSRTM output seems worth looking into. I’ll open a separate bug report for this if I manage to reproduce this.
Original Redmine Comment Author Name: Chris Fields Original Date: 2012-07-17T19:33:55Z
We do not guarantee reproduciblity with I/O, only that the data itself is maintained. FWIW, output has never been a primary focus, only that data is accurately parsed and accessible in a reproducible manner across formats (having decent output is more a side effect). We do try to support it when we can and when it is realistic to do so w/o affecting our (admittedly terrible) performance.
Author Name: Kai Blin (@kblin) Original Redmine Issue: 3325, https://redmine.open-bio.org/issues/3325 Original Date: 2012-02-10 Original Assignee: Kai Blin
While playing a bit with GenBank files, I noticed that our generator code doesn’t quite follow the GenBank release notes spec. E.g. a GenBank sequence always has to contain a “source” feature, and the last character in a DESCRIPTION field has to be a dot, our generated files ensure neither of this.
Nothing too serious, so opening a bug to remind me to fix it.