Closed saliksyed closed 6 years ago
Thanks for writing this @saliksyed, that's exactly right
If it's trivial would be useful to be able to filter pseudo genes on/off (but we can always do that on the frontend if not)
Here's JSON format that the frontend is using now but it's ok if there needs to be tweaks to make life easier
[
// Gene
{"name":"MIR6859-1","startIndex":17368,"length":68,"soClass":"ncRNA_gene","type":0,"class":2,"strand":3,"transcriptCount":1},
// Transcript
{"name":"MIR6859-1-201","startIndex":17368,"length":68,"soClass":"miRNA","type":1,"class":2},
// Transcript Component (an exon in this case)
{"name":"ENSE00003746039","startIndex":17368,"length":68,"soClass":"exon","type":2,"class":0},
// Gene
{"name":"MIR1302-2HG","startIndex":29553,"length":1556,"soClass":"ncRNA_gene","type":0,"class":2,"strand":2,"transcriptCount":2},
...
]
Because GFF3 doesn't explicitly distinguish between genes, transcripts and components I've added a "type"
field, where the enum is
enum GenomeFeatureType {
Gene = 0,
Transcript = 1,
TranscriptComponent = 2,
}
And "class"
field which subdivides the type, however for now we can skip those two fields and I'll do the classification on the frontend, but it's something that I think we should move to the backend in the future
Original GFF3 was:
1 mirbase ncRNA_gene 17369 17436 . - . ID=gene:ENSG00000278267;Name=MIR6859-1;biotype=miRNA;description=microRNA 6859-1 [Source:HGNC Symbol%3BAcc:HGNC:50039];gene_id=ENSG00000278267;logic_name=ncrna;version=1
1 mirbase miRNA 17369 17436 . - . ID=transcript:ENST00000619216;Parent=gene:ENSG00000278267;Name=MIR6859-1-201;biotype=miRNA;tag=basic;transcript_id=ENST00000619216;transcript_support_level=NA;version=1
1 mirbase exon 17369 17436 . - . Parent=transcript:ENST00000619216;Name=ENSE00003746039;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003746039;rank=1;version=1
1 havana ncRNA_gene 29554 31109 . + . ID=gene:ENSG00000243485;Name=MIR1302-2HG;biotype=lincRNA;description=MIR1302-2 host gene [Source:HGNC Symbol%3BAcc:HGNC:52482];gene_id=ENSG00000243485;logic_name=havana;version=5
@saliksyed @haxiomic Thanks a lot for the explanation.
I think we can use the type
information in GFF3, which is the 3rd field.
For gene
, the ENSEMBL dataset distinguish between gene
, pseudogene
and ncRNA_gene
.
For transcript
, we have transcript
, pseudogenic_transcript
, miRNA
, lnc_RNA
, mRNA
.
For transcript_components
, we have exon
, five_prime_UTR
Here are the ones I found but there might be more.
The flattened list is a simple representation, but I guess you might need to parse the type
field with the numbers to build the hirachical structure in the front end. Would it be more clear and easier to parse if we add a little bit structure here:
Original flattened list:
[gene1, transcript1, exon1, exon2, exon3, transcript2, exon1, exon2, exon3, transcriptN, exon1, exon2, exon3, gene2, transcript1, exon1, exon2, exon3, transcript2, exon1, exon2, exon3]
New structure:
[gene1, [transcript1, exon1, exon2, exon3], [transcript2, exon1, exon2, exon3], [transcriptN, exon1, exon2, exon3], gene2, [transcript1, exon1, exon2, exon3], [transcript2, exon1, exon2, exon3]]
With include_transcripts = false
these two will become the same.
When looping over this new structure, when we hit an object (dict), we know it's a gene; When we hit an array (list), we know it has the first element as the transcript and following ones as components of that transcript.
Two benifits we will have here: First the three-level data structure is embeded in this return list, so we don't need the GenomeFeatureType
, and (I guess) it might be easier to parse on the frontend. The transcript_count
may also be removed. Second, we can provide the original type
information from ENSEMBL GFF file. They're put into the soClass
field now but I think it might be more consistent to provide them as type
, because we are also checking the type
everywhere else in the frontend. It will also make the code more portable to other regular annotation tracks I guess.
For other fields in the exact JSON format, I think I will rename them a little bit to match what we have in the MongoDB. This will make later filters or renderrings easier by removing the translations back and forth.
Hey @yudong
Regarding types in the GFF3 files, the problem is that the full set of possible types is massive, you can see them all here: http://www.sequenceontology.org/browser/current_svn/term/SO:0000234
Often the frontend is interested in broader categories, if we used a big lookup table we'd need to update the frontend everytime a new type is encountered in a source dataset, otherwise we couldn't determine how to render a component. For example, after a new dataset is added it might introduce a new type scRNA
, the renderer has to decide to display this as protein coding feature or not but it won't be able to determine
Ensemble's Chromosome 1 has quite a few:
gene
ncRNA_gene
pseudogene
lnc_RNA
mRNA
pseudogenic_transcript
transcript
miRNA
ncRNA
rRNA
scRNA
snoRNA
snRNA
CDS
exon
five_prime_UTR
three_prime_UTR
Ideally when importing GFF files the backend would compare the soType against the Sequence Ontology database to determine the broad categories
The full ontological type hierarchy is downloadable here: https://github.com/The-Sequence-Ontology/SO-Ontologies
For the strand field let's use the GFF3 representation directly, so like "strand": "+"
, "strand": "-"
or "strand": "?"
--
Regarding structure, nested is OK too, I went with flat because I thought it'd save some effort on your end. I think if we use nested it should be full nesting, so
[
{
name: gene1,
transcripts: [{
name: transcript1,
components: [
{ name: component1 }
]
}, {
name: transcript2
components: [
{ name: component2 }
]
}]
}
...
]
(Something half-way between will make things more complicated)
But full nesting would make parsing very easy for end-users (JSON.parse() you get the full hierarchy)
You're right at this stage the "type"
field doesn't contribute new information however it's there because I'm expecting that in the future we're going to have different sorts of genomic features that are at the same level as genes. So we might have something like [promotor, gene, speically-marked-region, gene, ...]
@haxiomic Thanks a lot for the information! Sorry that the beginning of my last post was confusing. I think we are on the same track here.
Maybe we can determine things step by step?
First question is that if the /reference
end point should return a hirachical structure of data. Based on our previous discussions, I think we agreed that this endpoint will only be used to display ENSEMBL dataset, which has the hirachical data structure itself. However, if we ever want to start displaying other datasets, it will become tricky how to define the hirachy. For example, from ExAC we know a variant has three affected genes, but they're not in the same region. How to display the variant at bp 1M when we are looking at a gene at bp 2M is a question, and where should that variant be added into the hirachy is another question. My feeling is that if we later need to display more and more related data from different datasets in the reference track, we might want to switch to a query
based interaction, where the frontend launch queries to pull data and decide how to display them. This is not necessary for our first version as we are focusing on making it work with ENSEMBL GFF now.
Second, based on the first point, assuming we are working with the hirachy structure, the full nesting JSON format should be preferred. The general logic here is that the frontend should get the data structure it wants and the backend (which has more computing power and memory) should be responsible for building that data structure. This holds for all other endpoints.
Third, based on the second point that we have a full nested structure, I think we don't need to use GenomeFeatureType
and transcript_count
any more, because I guess they were used previously to build this nested structure.
@yudongqiu -- I think we just need this for the very specific use case of displaying the reference (so only Ensembl data of transcripts, exons and genes) -- we may want hierarchy for other datasets at some point (though we don't have any plans for this), but that is a future project outside of the scope.
The reference track shows data regarding genes, transcripts and exons for the reference track in the viewer. The endpoint will have the following API:
/reference/<start_bp>/<end_bp>?include_transcripts=<true/false>
Transcripts & Exons are only included if the
include_transcripts
parameter is set to true.It is important to note that the start_bp and end_bp range used to fetch transcripts & exons should be extended to fit the entire interval occupied by genes which intersect the original inputted range, otherwise we may lose transcripts for genes whose starting point is very close to the end_bp parameter or whose starting point is prior to start_bp, but whose span still intersects [start_bp, end_bp].
The reference track on the frontend needs data in a hierarchical order (genes, transcript, exons). As an example consider the following data:
This should be returned flattened as follows
[gene1, transcript1, exon1, exon2, exon3, transcript2, exon1, exon2, exon3, transcriptN, exon1, exon2, exon3, gene2, transcript1, exon1, exon2, exon3, transcript2, exon1, exon2, exon3]
Additionally each gene should also contain a
num_transcripts
field, which indicates the number of transcripts within the gene.Tasks: