cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

[WIP] Create JSON files for frontend consumption #29

Open dhimmel opened 7 years ago

dhimmel commented 7 years ago

Work in progress (WIP).

dhimmel commented 7 years ago

This pull request creates a mapping from gene (as Entrez GeneIDs) to the list of mutated samples (as TCGA sample IDs). This dictionary/JSON obejct is called gene_to_mutated_samples. As a JSON text file, it was 20.68 MB and 2.97 MB when gzip compressed.

This pull request also creates disease_to_samples, a dictionary of disease acronym to sample ID. This files is small (0.17 MB) and thus not a concern.

The goal of disease_to_samples and gene_to_mutated_samples was to allow the frontend to load these entire objects and then perform efficient set operations to get sample/positive/negative counts. For example, the user may have selected diseases = {'GBM', 'COAD', 'LUNG'} and mutations = {2641, 340024}.

The frontend would do the equivalent of this python in javascript:

mutated_samples = set()
for mutation in mutations:
    mutated_samples |= gene_to_mutated_samples[mutation]

selected_samples = set()
for disease in diseases:
    selected_samples |= disease_to_samples[disease]

# counts
n_samples = len(selected_samples)
n_positives = len(selected_samples & mutated_samples)
n_negatives = n_samples - n_positives

Alerting @bdolly, @awm33, @cgreene for discussion on how to proceed.

My questions are:

awm33 commented 7 years ago

Is 20.68 MB too big to pass to a browser?

Depends, I assume most people will be using this from a desktop with Wifi or a wired connection. So, from a pure transmitting bytes standpoint alone, no.

Will the payload be compressed in transit?

We can / should set up gzip compression on the server

Will the payload be cached?

If the correct headers are set by the server, yes. Other methods could be used as well, beyond HTTP caching, like localStorage.

Will this consume too much browser memory (RAM)?

Maybe. I'd be more worried about the access time. JavaScript is single threaded, if we were to calculate something like this client-side, I would use a web worker.

Should we switch to an int ID for samples to cut down this size?

It's the access performance, which should be hashmaps in JS, I don't think that would buy you much, if anything.

Or should we just have the frontend query the backend for these stats?

I would lean towards this for performance and API reasons. If we are also thinking of others using our API, this would make it easier for them. We're already using the django filter plugin which allows for querying on related model fields. This would be added to the /samples endpoint. We may want to use the field selection plugin to limit how much data is returned, assuming you just need the ids.

bdolly commented 7 years ago

@awm33 so I like the idea of using the field selection plugin to do this with rather than a large json file on app load. I think firing off small request on user keystroke doing search will be effecient as the plugin will return smaller faster responses

awm33 commented 7 years ago

@bdolly Cool

I created an issue/task for this https://github.com/cognoma/core-service/issues/33