VALIS-software / SIRIUS-backend

A graph analytics engine for genomic data
MIT License
0 stars 1 forks source link

Implement /reference endpoint #52

Closed saliksyed closed 6 years ago

saliksyed commented 6 years ago

The reference track shows data regarding genes, transcripts and exons for the reference track in the viewer. The endpoint will have the following API:

/reference/<start_bp>/<end_bp>?include_transcripts=<true/false>

Transcripts & Exons are only included if the include_transcripts parameter is set to true.

It is important to note that the start_bp and end_bp range used to fetch transcripts & exons should be extended to fit the entire interval occupied by genes which intersect the original inputted range, otherwise we may lose transcripts for genes whose starting point is very close to the end_bp parameter or whose starting point is prior to start_bp, but whose span still intersects [start_bp, end_bp].

The reference track on the frontend needs data in a hierarchical order (genes, transcript, exons). As an example consider the following data:

gene1
   transcript1
       exon1
       exon2
       exon3
   transcript2
       exon1
       exon2
       exon3
...
   transcriptN
       exon1
       exon2
       exon3
gene2
   transcript1
       exon1
       exon2
       exon3
   transcript2
       exon1
       exon2
       exon3

This should be returned flattened as follows [gene1, transcript1, exon1, exon2, exon3, transcript2, exon1, exon2, exon3, transcriptN, exon1, exon2, exon3, gene2, transcript1, exon1, exon2, exon3, transcript2, exon1, exon2, exon3]

Additionally each gene should also contain a num_transcripts field, which indicates the number of transcripts within the gene.

Tasks:

haxiomic commented 6 years ago

Thanks for writing this @saliksyed, that's exactly right

If it's trivial would be useful to be able to filter pseudo genes on/off (but we can always do that on the frontend if not)

Here's JSON format that the frontend is using now but it's ok if there needs to be tweaks to make life easier

[
    // Gene
    {"name":"MIR6859-1","startIndex":17368,"length":68,"soClass":"ncRNA_gene","type":0,"class":2,"strand":3,"transcriptCount":1},
    // Transcript
    {"name":"MIR6859-1-201","startIndex":17368,"length":68,"soClass":"miRNA","type":1,"class":2},
    // Transcript Component (an exon in this case)
    {"name":"ENSE00003746039","startIndex":17368,"length":68,"soClass":"exon","type":2,"class":0},

    // Gene
    {"name":"MIR1302-2HG","startIndex":29553,"length":1556,"soClass":"ncRNA_gene","type":0,"class":2,"strand":2,"transcriptCount":2},

    ...
]

Because GFF3 doesn't explicitly distinguish between genes, transcripts and components I've added a "type" field, where the enum is

enum GenomeFeatureType {
    Gene = 0,
    Transcript = 1,
    TranscriptComponent = 2,
}

And "class" field which subdivides the type, however for now we can skip those two fields and I'll do the classification on the frontend, but it's something that I think we should move to the backend in the future

Original GFF3 was:

1   mirbase ncRNA_gene  17369   17436   .   -   .   ID=gene:ENSG00000278267;Name=MIR6859-1;biotype=miRNA;description=microRNA 6859-1 [Source:HGNC Symbol%3BAcc:HGNC:50039];gene_id=ENSG00000278267;logic_name=ncrna;version=1
1   mirbase miRNA   17369   17436   .   -   .   ID=transcript:ENST00000619216;Parent=gene:ENSG00000278267;Name=MIR6859-1-201;biotype=miRNA;tag=basic;transcript_id=ENST00000619216;transcript_support_level=NA;version=1
1   mirbase exon    17369   17436   .   -   .   Parent=transcript:ENST00000619216;Name=ENSE00003746039;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003746039;rank=1;version=1
1   havana  ncRNA_gene  29554   31109   .   +   .   ID=gene:ENSG00000243485;Name=MIR1302-2HG;biotype=lincRNA;description=MIR1302-2 host gene [Source:HGNC Symbol%3BAcc:HGNC:52482];gene_id=ENSG00000243485;logic_name=havana;version=5
yudongqiu commented 6 years ago

@saliksyed @haxiomic Thanks a lot for the explanation.

I think we can use the type information in GFF3, which is the 3rd field.

For gene, the ENSEMBL dataset distinguish between gene, pseudogene and ncRNA_gene.

For transcript, we have transcript, pseudogenic_transcript, miRNA, lnc_RNA, mRNA.

For transcript_components, we have exon, five_prime_UTR

Here are the ones I found but there might be more.

The flattened list is a simple representation, but I guess you might need to parse the type field with the numbers to build the hirachical structure in the front end. Would it be more clear and easier to parse if we add a little bit structure here:

Original flattened list: [gene1, transcript1, exon1, exon2, exon3, transcript2, exon1, exon2, exon3, transcriptN, exon1, exon2, exon3, gene2, transcript1, exon1, exon2, exon3, transcript2, exon1, exon2, exon3]

New structure: [gene1, [transcript1, exon1, exon2, exon3], [transcript2, exon1, exon2, exon3], [transcriptN, exon1, exon2, exon3], gene2, [transcript1, exon1, exon2, exon3], [transcript2, exon1, exon2, exon3]]

With include_transcripts = false these two will become the same.

When looping over this new structure, when we hit an object (dict), we know it's a gene; When we hit an array (list), we know it has the first element as the transcript and following ones as components of that transcript.

Two benifits we will have here: First the three-level data structure is embeded in this return list, so we don't need the GenomeFeatureType, and (I guess) it might be easier to parse on the frontend. The transcript_count may also be removed. Second, we can provide the original type information from ENSEMBL GFF file. They're put into the soClass field now but I think it might be more consistent to provide them as type, because we are also checking the type everywhere else in the frontend. It will also make the code more portable to other regular annotation tracks I guess.

For other fields in the exact JSON format, I think I will rename them a little bit to match what we have in the MongoDB. This will make later filters or renderrings easier by removing the translations back and forth.

haxiomic commented 6 years ago

Hey @yudong

Regarding types in the GFF3 files, the problem is that the full set of possible types is massive, you can see them all here: http://www.sequenceontology.org/browser/current_svn/term/SO:0000234

Often the frontend is interested in broader categories, if we used a big lookup table we'd need to update the frontend everytime a new type is encountered in a source dataset, otherwise we couldn't determine how to render a component. For example, after a new dataset is added it might introduce a new type scRNA, the renderer has to decide to display this as protein coding feature or not but it won't be able to determine

Ensemble's Chromosome 1 has quite a few:

gene ncRNA_gene pseudogene lnc_RNA mRNA pseudogenic_transcript transcript miRNA ncRNA rRNA scRNA snoRNA snRNA CDS exon five_prime_UTR three_prime_UTR

Ideally when importing GFF files the backend would compare the soType against the Sequence Ontology database to determine the broad categories

The full ontological type hierarchy is downloadable here: https://github.com/The-Sequence-Ontology/SO-Ontologies

For the strand field let's use the GFF3 representation directly, so like "strand": "+", "strand": "-" or "strand": "?"

--

Regarding structure, nested is OK too, I went with flat because I thought it'd save some effort on your end. I think if we use nested it should be full nesting, so

[
    {
        name: gene1,
        transcripts: [{
            name: transcript1,
            components: [
                { name: component1 }
            ]
        }, {
            name: transcript2
            components: [
                { name: component2 }
            ]
        }]
    }
    ...
]

(Something half-way between will make things more complicated)

But full nesting would make parsing very easy for end-users (JSON.parse() you get the full hierarchy)

You're right at this stage the "type" field doesn't contribute new information however it's there because I'm expecting that in the future we're going to have different sorts of genomic features that are at the same level as genes. So we might have something like [promotor, gene, speically-marked-region, gene, ...]

yudongqiu commented 6 years ago

@haxiomic Thanks a lot for the information! Sorry that the beginning of my last post was confusing. I think we are on the same track here.

Maybe we can determine things step by step?

First question is that if the /reference end point should return a hirachical structure of data. Based on our previous discussions, I think we agreed that this endpoint will only be used to display ENSEMBL dataset, which has the hirachical data structure itself. However, if we ever want to start displaying other datasets, it will become tricky how to define the hirachy. For example, from ExAC we know a variant has three affected genes, but they're not in the same region. How to display the variant at bp 1M when we are looking at a gene at bp 2M is a question, and where should that variant be added into the hirachy is another question. My feeling is that if we later need to display more and more related data from different datasets in the reference track, we might want to switch to a query based interaction, where the frontend launch queries to pull data and decide how to display them. This is not necessary for our first version as we are focusing on making it work with ENSEMBL GFF now.

Second, based on the first point, assuming we are working with the hirachy structure, the full nesting JSON format should be preferred. The general logic here is that the frontend should get the data structure it wants and the backend (which has more computing power and memory) should be responsible for building that data structure. This holds for all other endpoints.

Third, based on the second point that we have a full nested structure, I think we don't need to use GenomeFeatureType and transcript_count any more, because I guess they were used previously to build this nested structure.

saliksyed commented 6 years ago

@yudongqiu -- I think we just need this for the very specific use case of displaying the reference (so only Ensembl data of transcripts, exons and genes) -- we may want hierarchy for other datasets at some point (though we don't have any plans for this), but that is a future project outside of the scope.