Support of 'GenBank' format

Background

GenBank format ("GenBank Flat File Format") is a genomic format that describes nucleotide sequences and their protein translations. GenBank file (.genbank, .gb, .gbk) is a specific text file that contains an annotation section and a sequence section simultaneously. It would be useful to support this file format in NGB, loading such files as reference sequences.

Approach

Format details

Each GenBank file consists of two parts:

annotation section. The start of the annotation section is marked by a line beginning with the word LOCUS
sequence section. The sequence section is being placed strictly below the annotation one. The start of the sequence section is marked by a line beginning with the word ORIGIN and the end of the section is marked by a line with only //.

Example of the GenBank file:

LOCUS       AF068625                 200 bp    mRNA    linear   ROD 06-DEC-1999
DEFINITION  Mus musculus DNA cytosine-5 methyltransferase 3A (Dnmt3a) mRNA,
            complete cds.
ACCESSION   AF068625 REGION: 1..200
VERSION     AF068625.2  GI:6449467
KEYWORDS    .
SOURCE      Mus musculus (house mouse)
  ORGANISM  Mus musculus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
            Sciurognathi; Muroidea; Muridae; Murinae; Mus.
REFERENCE   1  (bases 1 to 200)
  AUTHORS   Okano,M., Xie,S. and Li,E.
  TITLE     Cloning and characterization of a family of novel mammalian DNA
            (cytosine-5) methyltransferases
  JOURNAL   Nat. Genet. 19 (3), 219-220 (1998)
   PUBMED   9662389
FEATURES             Location/Qualifiers
     source          1..200
                     /organism="Mus musculus"
                     /mol_type="mRNA"
                     /db_xref="taxon:10090"
                     /chromosome="12"
                     /map="4.0 cM"
     gene            1..>200
                     /gene="Dnmt3a"
ORIGIN      
        1 gaattccggc ctgctgccgg gccgcccgac ccgccgggcc acacggcaga gccgcctgaa
       61 gcccagcgct gaggctgcac ttttccgagg gcttgacatc agggtctatg tttaagtctt
      121 agctcttgct tacaaagacc acggcaattc cttctctgaa gccctcgcag ccccacagcg
      181 ccctcgcagc cccagcctgc
//

Sequence section

This section contains nucleotide sequence or amino acids sequence. The sequence data begins on the line immediately below ORIGIN word. The sequence is divided on lines by 60 symbols in each row. In the beginning of each row, there is a position number of the row's first symbol - in the whole sequence length.

To implement:

support of the registration GenBank file as the reference sequence
all NGB CLI reference commands should support GenBank files in the same way as FASTA files now, no additional options are required
for now, only nucleotide sequences should be supported. When trying to register a GenBank file with amino acids sequence, the corresponding error message should appear

Annotation section and feature table

Annotation section contains a number of different data elements, including locus information, general description, unique sequence ID, organism information, reference(s) to the publications linked with current sequence, feature table, and etc. To view full list of possible elements (data fields) in the annotation section see here.

The most important part of the annotation section - feature table. This table provides data describing the roles and locations of higher order sequence domains and elements. The start of the feature table is marked by a line that also is the table header:

FEATURES             Location/Qualifiers

Table has its own specific format.

Short details:

format design is based on a tabular approach and consists of the following items:
- Feature key - a single word or abbreviation indicating functional group of the feature
- Location - instructions for finding the feature (indicates the region of the presented sequence which corresponds to a certain feature)
- Qualifiers - auxiliary information about a feature
each feature must contain a feature key:
- feature keys are being placed in the left column of the table
- all allowable feature keys are defined in the format
- each record (GenBank file) must contain at least source key - this key identifies the biological source of the specified span of the sequence:
- more than one source feature key per sequence is allowed
- file shall have, as a minimum, either a single source key spanning the entire sequence or multiple source keys, which together, span the entire sequence
- other feature key examples:
- CDS - coding sequence/ sequence of nucleotides that corresponds with the sequence of amino acids in a protein (location includes stop codon)/ feature includes amino acid conceptual translation
- exon - region of genome that codes for portion of spliced mRNA, rRNA and tRNA
- gene - region of biological interest identified as a gene and for which a name has been assigned
- intron - a segment of DNA that is transcribed, but removed from within the transcript by splicing together the sequences (exons) on either side of it
- misc_feature - region of biological interest which cannot be described by any other feature key
- mRNA - messenger RNA
- ncRNA - a non-protein-coding gene, other than ribosomal RNA and transfer RNA, the functional molecule of which is the RNA transcript
- prim_transcript - primary (initial, unprocessed) transcript
- tmRNA - transfer messenger RNA
- tRNA - mature transfer RNA
- variation - a related strain contains stable mutations from the same gene which differ from the presented sequence at this location
for each feature key must be specified its location on the sequence:
- the location contains at least one sequence location descriptor (base number(s) or identifier(s), site between them, and etc.) and may contain one or more operators (operator is a prefix that specifies what must be done to the indicated sequence to find or construct the location corresponding to the feature) with one or more sequence location descriptors
- location is specified in the same line as its feature key, in the right column
- several examples of the locations:
- 340..565 - points to a continuous range of bases bounded by and including the starting and ending bases
- <345..500 - indicates that the exact lower boundary point of a feature is unknown.
- 102.110 - indicates that the exact location is unknown but that it is one of the bases between bases 102 and 110, inclusive
- join(12..78,134..202) - regions 12 to 78 and 134 to 202 should be joined to form one contiguous sequence
- complement(34..126) - start at the base complementary to 126 and finish at the base complementary to base 34
qualifiers provide a general mechanism for supplying information about features in addition to that conveyed by the key and location:
- qualifiers can be mandatory and optional, not all specified feature keys shall have qualifiers
- each feature key has its own list of allowable qualifiers
- qualifiers take the form of a slash (/) followed by the qualifier name and, if applicable, an equal sign (=) and a value
- each qualifier should have a single value, if multiple values are necessary, these should be represented by iterating the same qualifier
- qualifiers of a certain feature key are specified in the right column of the table under the location line
- several examples of the qualifiers:
- /organism="Mus musculus"
- /strain="CD1"
- /mol_type="genomic DNA"
- /gene="ubc42"

Full description of the feature table format, possible feature keys and qualifiers, location descriptions and operators see here.

To implement:

support of the registration GenBank file as the gene file to the reference sequence:
- separately by add_genes command performing
- simultaneously with reg_ref command performing (using --genes option)
- all NGB CLI commands of gene files addition should support GenBank files in the same way as GTF files now, no additional options are required
support the displaying of the GenBank file at the GENE track in the "Browser" panel
- the view should be similar to the GTF GENE track:
- the gene itself (for "Collapsed" and "Expanded" views) shall be defined by gene feature key:
  - positions on a track should be defined from the location field of the gene record
  - gene name should be defined from the value of /gene qualifier of the gene record
  - qualifiers and their values of the gene record should be shown in tooltip menu
- other features (with other feature keys, e.g. CDs, mRNA, ncRNA):
  - shall be defined by the presence of /gene qualifier
  - shall be shown only in "Expanded" view
  - positions on a track should be defined from the location field of the feature record
  - qualifiers and their values of the feature record should be shown in tooltip menu

Server cannot query GenBank file format to get sequence or features in some interval so we need to preprocess these file to formats that support indexing & queries: fasta for sequence and gff/gtf for features. Could you please describe a mapping from GeneBank features to gff/gtf file format?

General details of the mapping GenBank format to gff3.

Common workflow should be:

go through the feature table
transform each feature to one or several record in gff format:
- source record should be omitted
- for each operon, gene, transcript (any *RNA) feature a single record in gff format should be created according the rules in the table below - with minimal position from location value as "Start" position and maximal position - as "End" position
- for any other feature:
- if the feature location is presented by one part (without join operator) - a single record in gff format should be created according the rules in the table below - with minimal position from location value as "Start" position and maximal position - as "End" position
- if the feature location is presented by several parts (using join operator) - several records in gff format should be created according the rules in the table below (one record for each sub-location, with minimal position from sub-location value as "Start" position and maximal position sub-location value - as "End" position). In this case, all these records should have a single ID (the same for each record)

`gff` field	`GenBank` element
"seqid"	Accession ID value from `ACCESSION` field. If accession ID is not specified, locus name should be used - first word from the `LOCUS` field
"source"	"GenBank"
"type"	Source type should be defined according to the feature key and its qualifiers' values: the following keys should remain the same in "transformed" file: `3'UTR`, `5'UTR`, `assembly_gap`, `C_region`, `CDS`, `centromere`, `D_segment`, `D-loop`, `exon`, `gap`, `gene`, `iDNA`, `J-segment`, `mobile_element`, `mRNA`, `N_region`, `ncRNA`, `operon`, `polyA_site`, `propeptide`, `regulatory`, `repeat_region`, `rRNA`, `S_region`, `stem_loop`, `STS`, `telomere`, `tmRNA`, `transit_peptide`, `tRNA`, `V_region`, `V_segment` the following keys should be transformed to: `mat_peptide` -> mature_protein_region `misc_binding` -> binding_site `misc_difference` -> sequence_difference `misc_feature` -> region `misc_recomb` -> recombination_feature `misc_RNA` -> mature_transcript `misc_structure` -> sequence_secondary_structure `modified_base` -> modified_base_site `old_sequence` -> region `oriT` -> origin_of_transfer `precursor_RNA` -> primary_transcript `prim_transcript` -> primary_transcript `primer_bind` -> primer_binding_site `protein_bind` -> protein_binding_site `rep_origin` -> origin_of_replication `sig_peptide` -> signal_peptide `unsure` -> region `variation` -> sequence_variant the following keys should be transformed in the described way only if they have qualifier `/pseudo`: `C_region` -> pseudoC_region `CDS` -> pseudogenic_exon `D_segment` -> pseudoD_segment `exon` -> pseudogenic_exon `gene` -> pseudogene `intron` -> pseudogenic_region `J_segment` -> pseudoJ_segment `mat_peptide` -> pseudomat_peptide `misc_feature` -> pseudogenic_region `misc_RNA` -> pseudogenic_transcript `mRNA` -> pseudogenic_transcript `N_region` -> pseudoN_region `ncRNA` -> pseudoncRNA `operon` -> pseudooperon `propeptide` -> pseudopropeptide `regulatory` -> pseudoregulatory `rRNA` -> pseudorRNA `S_region` -> pseudoS_region `sig_peptide` -> pseudosig_peptide `tmRNA` -> pseudotmRNA `transit_peptide` -> pseudotransit_peptide `tRNA` -> pseudotRNA `V_region` -> pseudoV_region `V_segment` -> pseudoV_segment the key `source` should be ignored if the key does not fall under any of rules above, it should be transformed to region
"start"	Start position should be defined as minimal (smallest) specified position: for records from `operon`, `gene`, any `RNA`* keys - in the feature location</li><li>for all other records - in the feature location or sub-location. Sub-location is one of the constituent parts of the key location. Summary location can join different sub-locations, e.g.:`join(1..12,34..45)`- in such case, this location contains 2 sub-locations -`1..12`and`34..45`<br/><br/>_Note_: if the location is specified with signs`<`or`>`- these signs should not be considered, e.g. for location`<10..23`"start" position should be defined as`10`
"end"	End position should be defined as maximal (largest) specified position (with the same options as described for "start")
"score"	"." (point)
"strand"	Strand value should be defined as: "-" (minus symbol) - if the feature location contains `complement` operator (e.g. `complement(1..100)` or `complement(join(1..100,200..350))` or `join(complement(1..100),complement(200..350))`) "+" (plus symbol) - if the `complement` operator is not used in the feature location definition
"phase"	Phase value should be defined as: "." (point) - if for the feature, the qualifier `/codon_start` is not specified "<`/codon-start` value decreased by 1>", if for the feature, the qualifier `/codon_start` is specified
"attributes":
- ID	For "operon" - the `/operon` qualifier value For "gene" (and "pseudogene") - the `/locus_tag` qualifier value For other features: - if the `/locus_tag` qualifier value of that feature is encountered in first time - the `/locus_tag` qualifier value should be used - if the `/locus_tag` qualifier value of that feature was used for any previous record, then `ID` should be `<locus_tag>.<type><index>`, where `<locus_tag>` is the `/locus_tag` qualifier value, `<type>` - feature type, `<index>` - order number of the feature with certain locus_tag and type. Example: `IVR12_00001.exon02` If no case falls under any of rules above (e.g., the `/locus_tag` qualifier is omitted), then `ID` should be `<seqid>.<type><index>`, where `<seqid>` is the "seqid" value of the record, `<type>` - feature type, `<index>` - order number of the feature with certain "seqid" and type. Example: `PSALPS670.regulatory03`
- Name	The `/gene` qualifier value. If it's absent, the `/locus_tag` qualifier value should be used. If it's absent, this attribute is not being specified
- Dbxref	The `/db_xref` qualifier value if exists (comma-separated massive if there are several values for one feature key)
- Note	The `/note` qualifier value if exists
- other	All feature qualifiers with their values (except described above) should be specified in the format `tag=value`, where `tag` is qualifier name and `value` is qualifier value. Multiple `tag=value` pairs should be separated by semicolons. URL escaping rules are used for tags or values containing the following characters: `,=;`. Spaces are allowed, but tabs must be replaced with the `%09` URL escape. If any qualifier is specified but has no value (e.g. `/pseudo` qualifier) - for such tags the value `true` can be specified

epam / NGB