bioperl / bioperl-live

Core BioPerl 1.x code
http://bioperl.org
297 stars 182 forks source link

‎The DDBJ/ENA/GenBank accession number change #301

Open heikkil opened 5 years ago

heikkil commented 5 years ago

https://ncbiinsights.ncbi.nlm.nih.gov/2018/12/03/adapting-flatfile-parsers-genbank-new-accession-formats/

"the LOCUS line, includes the “Locus Name” (usually identical to the accession number), which may now grow to as long as 20 characters."

"See section 3.4.4 of the GenBank release notes for examples of how the LOCUS line might change." https://ftp.ncbi.nlm.nih.gov/genbank/gbrel.txt

From our internal testing, it appears BioPython and BioPerl properly handle most of the examples shown in section 3.4.4, and only have issues with the last theoretical examples where the sequence length no longer ends at position 40. We do recommend adjusting code to accommodate those theoretical examples for future-proofing.

https://ncbiinsights.ncbi.nlm.nih.gov/2018/09/19/genbank-expanded-accession-formats/

https://ftp.ncbi.nlm.nih.gov/genbank/gbrel.txt 1.4 Upcoming Changes 1.4.1 Changes to nucleotide and protein accession formats By the end of 2018 the INSDC members plan to expand this format, using a six-letter Project Code prefix, two-digit Assembly-Version number, followed by 7, 8, or 9 digits. An example of such an accession is AAAAAA020000001 .

cjfields commented 5 years ago

@heikkil just came here to add the same thing 😄

peterjc commented 5 years ago

Cross reference https://github.com/biopython/biopython/issues/1870