cognoma / machine-learning

Machine learning for Project Cognoma
Other
32 stars 47 forks source link

Using Vega-Lite and Altair API to visualize the data #74

Closed superkostya closed 7 years ago

superkostya commented 7 years ago

This is a preliminary result for using the combination of Vega-Lite and Altair to visualize some of the obtained results, e.g. heatmaps. The main objective is to take advantage of the lean and sufficiently flexible JSON format for the graphs in Vega-Lite, which should allow us to generate the figures (at least some of them) on the front end, thereby reducing the Internet traffic and increasing the performance and speed.

The changes are as follows:

  1. Added a new file explore/Visualization_with_Vega-Lite_and_Altair.ipynb
  2. Modified the file environment.yml
dhimmel commented 7 years ago

Cool, really exciting. Can't wait to take a deeper look.

Can you export the notebook to a script for easier code review? From explore, run:

jupyter nbconvert --to=script Visualization_with_Vega-Lite_and_Altair.ipynb

Also I suggest moving this analysis to a new directory inside of explore. In this directory, you can also export a vega-lite-heatmap.json specification as it's own file.

superkostya commented 7 years ago

Done. The created JSON file has a few changes already applied to it to improve the appearance. As I pointed out in the notebook, more formatting options need to be explored.

dhimmel commented 7 years ago

I think the next step is to touch up the vega-lite specificuation separately from altair. You've started to do this in your final notebook cell. What I think would be ideal is to separate the JSON for the dataset from the JSON of the vega-lite spec.

See this function for exporting a pandas.DataFrame to the vega-lite JSON specification. Once you upload the data to GitHub, you can modify the vega-lite spec to load the data from a URL (example).

Then we'll be able to give the JSON spec directly to the frontend and they'll generate the data.

dhimmel commented 7 years ago

@bdolly currently @superkostya is generating the heatmap from the following data structure:

"data": {
  "values": [
    {
      "disease": "adrenocortical cancer",
      "gene_symbol": "AJUBA",
      "count": 0.01282051282051282
    },
    {
      "disease": "adrenocortical cancer",
      "gene_symbol": "AMOT",
      "count": 0
    },
    {
      "disease": "adrenocortical cancer",
      "gene_symbol": "AMOTL1",
      "count": 0
    },
    {
      "disease": "adrenocortical cancer",
      "gene_symbol": "AMOTL2",
      "count": 0.01282051282051282
    },
    {
      "disease": "adrenocortical cancer",
      "gene_symbol": "LATS1",
      "count": 0
    }
  ]
}

Each value encodes a single cell in the heatmap and is a (disease, gene, frequency-of-mutation) combination. The idea is for the heatmap to show all of the diseases and genes the user has selected. We can obviously change the what types of IDs we're using for genes and diseases.

@bdolly can the frontend generate the above data structure? Or should we accommodate a different input data structure?

George-Zipperlen commented 7 years ago

Hi Kostyra, and Daniel,

Happy New Year.

Sorry to be late replying, just getting back into the swing of things.

The Chart object can take a file/url argument instead of a dataframe. This is how I’ve been doing it:

# heatmap cell size in pixels, matches default text size 
# in jupyter notebook. 
hm_cell_pixel_size=(8, 8)

hm_data_url = '3-tcga-hmdata.csv’

# hm_df is the tidied/normalized dataframe previously computed,
# or passed in once this is made into a function
hm_df.to_csv(hm_data_url)

hm_chart_url = '3-tcga-hmchart.json'
hm_chart_file = open(hm_chart_url,'w’)
hm_chart = Chart(hm_data_url).mark_text(
    other parameters, ...)
print(hm_chart.to_json(indent=2), file=hm_chart_file)
hm_chart_file.close()

# altair chart display must be on the last line of jupyter cell
# this is a gotcha I found buried in the altair documentation
hm_chart

Minor nit: The TOTAL column should be moved to the right. This should be an easy slice and dice. Better yet, make it a parallel, single column heatmap, as it is not a gene_symbol. Compute it as part of the heatmap display process, rather than in the disease/gene_symbol dataframe as is currently done in "3.TCGA-MLexample_Pathway"

Management of the file name space, and deletion of .csv and .json files when no longer needed will need to be coordinated.

superkostya commented 7 years ago

Daniel, Several changes have been made per your suggestion:

  1. Dataframes have been renamed to more meaningful names, see the notebook "Visualization_with_Vega-Lite_and_Altair.ipynb"
  2. The data for the heatmap has been stored in a separate file (heatmap_data_Altair_compatible.json). Note that this a so called tidy (aka long) format, which is one of the requirements in Altair API.
  3. The Vega-lite compatible JSON file for the Heatmap has been created (heatmap_vega-lite.json). It does not contain the dataset; instead, the data to be visualized is read from a file "heatmap_data_Altair_compatible.json". See the line "url": "./heatmap_data_Altair_compatible.json"
dhimmel commented 7 years ago

Nice, looks almost ready to merge.

Can we rename explore/visualization_vega_lite_altair/ to explore/heatmap-vega-lite/?

Would be nice if we could change the dashes to underscores in paths. So Visualization_with_Vega-Lite_and_Altair.ipynb becomes Visualization-with-Vega-Lite-and-Altair.ipynb. Or even simplify toheatmap.ipynb`.

superkostya commented 7 years ago

Done. Files and the main directory are renamed per your suggestion.