JATS4R / JATS4R-Participant-Hub

The hub for all JATS4R meeting notes, examples, draft recommendations, documents, and issues.
http://jats4r.org
17 stars 20 forks source link

NCBI is phasing out sequence GIs - use Accession.Version instead! #122

Closed Melissa37 closed 6 years ago

Melissa37 commented 8 years ago

As of September 2016, the integer sequence identifiers known as "GIs" will no longer be included in the GenBank, GenPept, and FASTA formats supported by NCBI for sequence records. The FASTA header will be further simplified to report only the sequence accession.version and record title for accessions managed by the International Sequence Database Collaboration (INSDC) and NCBI’s Reference Sequence (RefSeq) project. As NCBI makes this transition, we encourage any users who have workflows that depend on GI's to begin planning to use accession.version identifiers instead. After September 2016, any processes solely dependent on GIs will no longer function as expected.

GI numbers have been in use since GenBank release 81.0 (February 1994) as an additional identifier to the accession number to stably refer to a specific version of a sequence record. Version tracking was added to accession numbers in 1997 as an integer suffix that increments with each update to the sequence data within a record. For example, “AC020606.7” indicates that the sequence content of the record has been updated six times since the first release. Thus, sequence versioning information has been provided in a redundant fashion through both the GI and the accession.version. In the past decade, NCBI has continued to receive submissions of new or updated sequences at a rapidly increasing rate. In response to this, we have had to develop new data storage solutions that use accession.version information, rather than GI information, to track updates. Current examples of sequences that lack a GI include unannotated contigs in WGS and TSA projects. This results in a situation where we are conveying version information inconsistently.

Given both the continued increase in the volume of data submissions and the growing inconsistency in record presentation, it is time for us to take the next step and remove the older, redundant GI identifiers and retain a single identifier for sequence versions, the more human-readable accession.version. This change will simplify the process of tracking sequences without any loss of functionality. This change will also simplify scientific communications by promoting use of accession.version as the preferred sequence identifier. Therefore, over the coming months we will no longer assign GI's to an increasing number of new sequences. Sequence records with existing GI's will retain them in some presentation formats, such as ASN.1 and the 5-column feature table format, but the GI value will no longer be displayed in other presentation formats including GenBank flat file and FASTA formats. NCBI services that accept GI's as input will continue to be supported, and NCBI will be adding support for accession.version identifiers to all services that currently do not support them.

This transition to stop assigning and reporting GIs was first described in the Release Notes for GenBank 199.0 in December 2013 and also described in a recent GenBank update. Please see Section 1.4.1 of the current GenBank release notes for background information: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt

The FASTA display for all sequence records exchanged by the INSDC and for all NCBI RefSeq records will also be changed to report only the accession.version and the record title. This will improve compatibility with other file types provided by NCBI, including GFF3, Gene, and dbSNP download files. This FASTA format change has already been made on data available from the redesigned genomes FTP site based on user requests to have a single consistent sequence identifier for both GFF3 and FASTA formats. See the prior announcement of this change: http://www.ncbi.nlm.nih.gov/news/08-26-2014-new-genomes-FTP-live/ .. At this time, we plan to continue to provide database source information in the FASTA display of sequences from non-INSDC and non-RefSeq sources including: SwissProt, PDB structures, PIR, and patent sequences.

After September 2016, these changes will start to appear on NCBI web views of flat file and FASTA format sequence data, NCBI programming utilities results, and GenBank and RefSeq comprehensive FTP releases.