epam / NGB

New Genome Browser (NGB) - a Web - based NGS data viewer with unique Structural Variations (SVs) visualization capabilities, high performance, scalability, and cloud data support
MIT License
156 stars 43 forks source link

Support of 'GenBank' format #441

Open NShaforostov opened 3 years ago

NShaforostov commented 3 years ago

Background

GenBank format ("GenBank Flat File Format") is a genomic format that describes nucleotide sequences and their protein translations. GenBank file (.genbank, .gb, .gbk) is a specific text file that contains an annotation section and a sequence section simultaneously. It would be useful to support this file format in NGB, loading such files as reference sequences.

Approach

Format details

Each GenBank file consists of two parts:

Example of the GenBank file:

LOCUS       AF068625                 200 bp    mRNA    linear   ROD 06-DEC-1999
DEFINITION  Mus musculus DNA cytosine-5 methyltransferase 3A (Dnmt3a) mRNA,
            complete cds.
ACCESSION   AF068625 REGION: 1..200
VERSION     AF068625.2  GI:6449467
KEYWORDS    .
SOURCE      Mus musculus (house mouse)
  ORGANISM  Mus musculus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
            Sciurognathi; Muroidea; Muridae; Murinae; Mus.
REFERENCE   1  (bases 1 to 200)
  AUTHORS   Okano,M., Xie,S. and Li,E.
  TITLE     Cloning and characterization of a family of novel mammalian DNA
            (cytosine-5) methyltransferases
  JOURNAL   Nat. Genet. 19 (3), 219-220 (1998)
   PUBMED   9662389
FEATURES             Location/Qualifiers
     source          1..200
                     /organism="Mus musculus"
                     /mol_type="mRNA"
                     /db_xref="taxon:10090"
                     /chromosome="12"
                     /map="4.0 cM"
     gene            1..>200
                     /gene="Dnmt3a"
ORIGIN      
        1 gaattccggc ctgctgccgg gccgcccgac ccgccgggcc acacggcaga gccgcctgaa
       61 gcccagcgct gaggctgcac ttttccgagg gcttgacatc agggtctatg tttaagtctt
      121 agctcttgct tacaaagacc acggcaattc cttctctgaa gccctcgcag ccccacagcg
      181 ccctcgcagc cccagcctgc
//

Sequence section

This section contains nucleotide sequence or amino acids sequence. The sequence data begins on the line immediately below ORIGIN word. The sequence is divided on lines by 60 symbols in each row. In the beginning of each row, there is a position number of the row's first symbol - in the whole sequence length.

To implement:

Annotation section and feature table

Annotation section contains a number of different data elements, including locus information, general description, unique sequence ID, organism information, reference(s) to the publications linked with current sequence, feature table, and etc. To view full list of possible elements (data fields) in the annotation section see here.

The most important part of the annotation section - feature table. This table provides data describing the roles and locations of higher order sequence domains and elements. The start of the feature table is marked by a line that also is the table header:

FEATURES             Location/Qualifiers

Table has its own specific format.

Short details:

Full description of the feature table format, possible feature keys and qualifiers, location descriptions and operators see here.

To implement:

mzueva commented 3 years ago

@NShaforostov Some points to consider:

mzueva commented 3 years ago
Server Implementation
NShaforostov commented 3 years ago
  • Server cannot query GenBank file format to get sequence or features in some interval so we need to preprocess these file to formats that support indexing & queries: fasta for sequence and gff/gtf for features. Could you please describe a mapping from GeneBank features to gff/gtf file format?

General details of the mapping GenBank format to gff3.

Common workflow should be:

gff field GenBank element
"seqid" Accession ID value from ACCESSION field.
If accession ID is not specified, locus name should be used - first word from the LOCUS field
"source" "GenBank"
"type" Source type should be defined according to the feature key and its qualifiers' values:
  • the following keys should remain the same in "transformed" file: 3'UTR, 5'UTR, assembly_gap, C_region, CDS, centromere, D_segment, D-loop, exon, gap, gene, iDNA, J-segment, mobile_element, mRNA, N_region, ncRNA, operon, polyA_site, propeptide, regulatory, repeat_region, rRNA, S_region, stem_loop, STS, telomere, tmRNA, transit_peptide, tRNA, V_region, V_segment

  • the following keys should be transformed to:
    mat_peptide -> mature_protein_region
    misc_binding -> binding_site
    misc_difference -> sequence_difference
    misc_feature -> region
    misc_recomb -> recombination_feature
    misc_RNA -> mature_transcript
    misc_structure -> sequence_secondary_structure
    modified_base -> modified_base_site
    old_sequence -> region
    oriT -> origin_of_transfer
    precursor_RNA -> primary_transcript
    prim_transcript -> primary_transcript
    primer_bind -> primer_binding_site
    protein_bind -> protein_binding_site
    rep_origin -> origin_of_replication
    sig_peptide -> signal_peptide
    unsure -> region
    variation -> sequence_variant

  • the following keys should be transformed in the described way only if they have qualifier /pseudo:
    C_region -> pseudoC_region
    CDS -> pseudogenic_exon
    D_segment -> pseudoD_segment
    exon -> pseudogenic_exon
    gene -> pseudogene
    intron -> pseudogenic_region
    J_segment -> pseudoJ_segment
    mat_peptide -> pseudomat_peptide
    misc_feature -> pseudogenic_region
    misc_RNA -> pseudogenic_transcript
    mRNA -> pseudogenic_transcript
    N_region -> pseudoN_region
    ncRNA -> pseudoncRNA
    operon -> pseudooperon
    propeptide -> pseudopropeptide
    regulatory -> pseudoregulatory
    rRNA -> pseudorRNA
    S_region -> pseudoS_region
    sig_peptide -> pseudosig_peptide
    tmRNA -> pseudotmRNA
    transit_peptide -> pseudotransit_peptide
    tRNA -> pseudotRNA
    V_region -> pseudoV_region
    V_segment -> pseudoV_segment

  • the key source should be ignored

  • if the key does not fall under any of rules above, it should be transformed to region
  • "start" Start position should be defined as minimal (smallest) specified position:
  • for records from operon, gene, any *`RNA** keys - in the feature location</li><li>for all other records - in the feature location or sub-location. Sub-location is one of the constituent parts of the key location. Summary location can join different sub-locations, e.g.:join(1..12,34..45)- in such case, this location contains 2 sub-locations -1..12and34..45<br/><br/>**_Note_**: if the location is specified with signs<or>- these signs should not be considered, e.g. for location<10..23"**start**" position should be defined as10`
  • "end" End position should be defined as maximal (largest) specified position (with the same options as described for "start")
    "score" "." (point)
    "strand" Strand value should be defined as:
  • "-" (minus symbol) - if the feature location contains complement operator (e.g. complement(1..100) or complement(join(1..100,200..350)) or join(complement(1..100),complement(200..350)))
  • "+" (plus symbol) - if the complement operator is not used in the feature location definition
  • "phase" Phase value should be defined as:
  • "." (point) - if for the feature, the qualifier /codon_start is not specified
  • "</codon-start value decreased by 1>", if for the feature, the qualifier /codon_start is specified
  • "attributes":
    - ID
  • For "operon" - the /operon qualifier value
  • For "gene" (and "pseudogene") - the /locus_tag qualifier value
  • For other features:
    - if the /locus_tag qualifier value of that feature is encountered in first time - the /locus_tag qualifier value should be used
    - if the /locus_tag qualifier value of that feature was used for any previous record, then ID should be <locus_tag>.<type><index>, where <locus_tag> is the /locus_tag qualifier value, <type> - feature type, <index> - order number of the feature with certain locus_tag and type. Example: IVR12_00001.exon02

  • If no case falls under any of rules above (e.g., the /locus_tag qualifier is omitted), then ID should be <seqid>.<type><index>, where <seqid> is the "seqid" value of the record, <type> - feature type, <index> - order number of the feature with certain "seqid" and type. Example: PSALPS670.regulatory03
    - Name The /gene qualifier value. If it's absent, the /locus_tag qualifier value should be used. If it's absent, this attribute is not being specified
    - Dbxref The /db_xref qualifier value if exists (comma-separated massive if there are several values for one feature key)
    - Note The /note qualifier value if exists
    - other All feature qualifiers with their values (except described above) should be specified in the format tag=value, where tag is qualifier name and value is qualifier value. Multiple tag=value pairs should be separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ,=;. Spaces are allowed, but tabs must be replaced with the %09 URL escape. If any qualifier is specified but has no value (e.g. /pseudo qualifier) - for such tags the value true can be specified
    NShaforostov commented 2 years ago

    Docs were added via #566 and located here.