Open NShaforostov opened 3 years ago
@NShaforostov Some points to consider:
sequence
sectionfasta
for sequence and gff/gtf
for features. Could you please describe a mapping from GeneBank features to gff/gtf
file format?fasta
and gtf
files and register these files as reference and gene files.features
data into gtf
file and register this file as gene annotation file.source
to all feature files and save initial file path (GeneBank file) in this field. For regular feature files source
shall be equal to path
.
- Server cannot query GenBank file format to get sequence or features in some interval so we need to preprocess these file to formats that support indexing & queries:
fasta
for sequence andgff/gtf
for features. Could you please describe a mapping from GeneBank features togff/gtf
file format?
General details of the mapping GenBank
format to gff3
.
Common workflow should be:
gff
format:
source
record should be omitted operon
, gene
, transcript (any *RNA
) feature a single record in gff
format should be created according the rules in the table below - with minimal position from location value as "Start" position and maximal position - as "End" positionjoin
operator) - a single record in gff
format should be created according the rules in the table below - with minimal position from location value as "Start" position and maximal position - as "End" positionjoin
operator) - several records in gff
format should be created according the rules in the table below (one record for each sub-location, with minimal position from sub-location value as "Start" position and maximal position sub-location value - as "End" position). In this case, all these records should have a single ID (the same for each record)gff field |
GenBank element |
---|---|
"seqid" | Accession ID value from ACCESSION field.If accession ID is not specified, locus name should be used - first word from the LOCUS field |
"source" | "GenBank" |
"type" | Source type should be defined according to the feature key and its qualifiers' values:3'UTR , 5'UTR , assembly_gap , C_region , CDS , centromere , D_segment , D-loop , exon , gap , gene , iDNA , J-segment , mobile_element , mRNA , N_region , ncRNA , operon , polyA_site , propeptide , regulatory , repeat_region , rRNA , S_region , stem_loop , STS , telomere , tmRNA , transit_peptide , tRNA , V_region , V_segment mat_peptide -> mature_protein_regionmisc_binding -> binding_sitemisc_difference -> sequence_differencemisc_feature -> regionmisc_recomb -> recombination_featuremisc_RNA -> mature_transcriptmisc_structure -> sequence_secondary_structuremodified_base -> modified_base_siteold_sequence -> regionoriT -> origin_of_transferprecursor_RNA -> primary_transcriptprim_transcript -> primary_transcriptprimer_bind -> primer_binding_siteprotein_bind -> protein_binding_siterep_origin -> origin_of_replicationsig_peptide -> signal_peptideunsure -> regionvariation -> sequence_variant/pseudo :C_region -> pseudoC_regionCDS -> pseudogenic_exonD_segment -> pseudoD_segmentexon -> pseudogenic_exongene -> pseudogeneintron -> pseudogenic_regionJ_segment -> pseudoJ_segmentmat_peptide -> pseudomat_peptidemisc_feature -> pseudogenic_regionmisc_RNA -> pseudogenic_transcriptmRNA -> pseudogenic_transcriptN_region -> pseudoN_regionncRNA -> pseudoncRNAoperon -> pseudooperonpropeptide -> pseudopropeptideregulatory -> pseudoregulatoryrRNA -> pseudorRNAS_region -> pseudoS_regionsig_peptide -> pseudosig_peptidetmRNA -> pseudotmRNAtransit_peptide -> pseudotransit_peptidetRNA -> pseudotRNAV_region -> pseudoV_regionV_segment -> pseudoV_segmentsource should be ignored |
"start" | Start position should be defined as minimal (smallest) specified position:operon , gene , any *`RNA** keys - in the feature location</li><li>for all other records - in the feature location or sub-location. Sub-location is one of the constituent parts of the key location. Summary location can join different sub-locations, e.g.: join(1..12,34..45)- in such case, this location contains 2 sub-locations - 1..12and 34..45<br/><br/>**_Note_**: if the location is specified with signs <or >- these signs should not be considered, e.g. for location <10..23"**start**" position should be defined as 10` |
"end" | End position should be defined as maximal (largest) specified position (with the same options as described for "start") |
"score" | "." (point) |
"strand" | Strand value should be defined as:complement operator (e.g. complement(1..100) or complement(join(1..100,200..350)) or join(complement(1..100),complement(200..350)) )complement operator is not used in the feature location definition |
"phase" | Phase value should be defined as:/codon_start is not specified/codon-start value decreased by 1>", if for the feature, the qualifier /codon_start is specified |
"attributes": | |
- ID | /operon qualifier value/locus_tag qualifier value- if the /locus_tag qualifier value of that feature is encountered in first time - the /locus_tag qualifier value should be used- if the /locus_tag qualifier value of that feature was used for any previous record, then ID should be <locus_tag>.<type><index> , where <locus_tag> is the /locus_tag qualifier value, <type> - feature type, <index> - order number of the feature with certain locus_tag and type. Example: IVR12_00001.exon02 If no case falls under any of rules above (e.g., the /locus_tag qualifier is omitted), then ID should be <seqid>.<type><index> , where <seqid> is the "seqid" value of the record, <type> - feature type, <index> - order number of the feature with certain "seqid" and type. Example: PSALPS670.regulatory03 |
- Name | The /gene qualifier value. If it's absent, the /locus_tag qualifier value should be used. If it's absent, this attribute is not being specified |
- Dbxref | The /db_xref qualifier value if exists (comma-separated massive if there are several values for one feature key) |
- Note | The /note qualifier value if exists |
- other | All feature qualifiers with their values (except described above) should be specified in the format tag=value , where tag is qualifier name and value is qualifier value. Multiple tag=value pairs should be separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ,=; . Spaces are allowed, but tabs must be replaced with the %09 URL escape. If any qualifier is specified but has no value (e.g. /pseudo qualifier) - for such tags the value true can be specified |
Docs were added via #566 and located here.
Background
GenBank
format ("GenBank Flat File Format") is a genomic format that describes nucleotide sequences and their protein translations.GenBank
file (.genbank
,.gb
,.gbk
) is a specific text file that contains an annotation section and a sequence section simultaneously. It would be useful to support this file format in NGB, loading such files as reference sequences.Approach
Format details
Each
GenBank
file consists of two parts:LOCUS
ORIGIN
and the end of the section is marked by a line with only//
.Example of the
GenBank
file:Sequence section
This section contains nucleotide sequence or amino acids sequence. The sequence data begins on the line immediately below
ORIGIN
word. The sequence is divided on lines by 60 symbols in each row. In the beginning of each row, there is a position number of the row's first symbol - in the whole sequence length.To implement:
GenBank
file as the reference sequenceGenBank
files in the same way asFASTA
files now, no additional options are requiredGenBank
file with amino acids sequence, the corresponding error message should appearAnnotation section and feature table
Annotation section contains a number of different data elements, including locus information, general description, unique sequence ID, organism information, reference(s) to the publications linked with current sequence, feature table, and etc. To view full list of possible elements (data fields) in the annotation section see here.
The most important part of the annotation section - feature table. This table provides data describing the roles and locations of higher order sequence domains and elements. The start of the feature table is marked by a line that also is the table header:
Table has its own specific format.
Short details:
GenBank
file) must contain at leastsource
key - this key identifies the biological source of the specified span of the sequence:source
feature key per sequence is allowedsource
key spanning the entire sequence or multiplesource
keys, which together, span the entire sequenceCDS
- coding sequence/ sequence of nucleotides that corresponds with the sequence of amino acids in a protein (location includes stop codon)/ feature includes amino acid conceptual translationexon
- region of genome that codes for portion of spliced mRNA, rRNA and tRNAgene
- region of biological interest identified as a gene and for which a name has been assignedintron
- a segment of DNA that is transcribed, but removed from within the transcript by splicing together the sequences (exons) on either side of itmisc_feature
- region of biological interest which cannot be described by any other feature keymRNA
- messenger RNAncRNA
- a non-protein-coding gene, other than ribosomal RNA and transfer RNA, the functional molecule of which is the RNA transcriptprim_transcript
- primary (initial, unprocessed) transcripttmRNA
- transfer messenger RNAtRNA
- mature transfer RNAvariation
- a related strain contains stable mutations from the same gene which differ from the presented sequence at this location340..565
- points to a continuous range of bases bounded by and including the starting and ending bases<345..500
- indicates that the exact lower boundary point of a feature is unknown.102.110
- indicates that the exact location is unknown but that it is one of the bases between bases102
and110
, inclusivejoin(12..78,134..202)
- regions12
to78
and134
to202
should be joined to form one contiguous sequencecomplement(34..126)
- start at the base complementary to126
and finish at the base complementary to base34
/
) followed by the qualifier name and, if applicable, an equal sign (=
) and a value/organism="Mus musculus"
/strain="CD1"
/mol_type="genomic DNA"
/gene="ubc42"
Full description of the feature table format, possible feature keys and qualifiers, location descriptions and operators see here.
To implement:
GenBank
file as the gene file to the reference sequence:add_genes
command performingreg_ref
command performing (using--genes
option)GenBank
files in the same way asGTF
files now, no additional options are requiredGenBank
file at the GENE track in the "Browser" panelGTF
GENE track:gene
feature key:gene
record/gene
qualifier of thegene
recordgene
record should be shown in tooltip menuCDs
,mRNA
,ncRNA
):/gene
qualifier