Visualizing pre-classifier data

gwaybio commented 7 years ago

At the meetup last night (9/6) we discussed already existing services that provide a way of visualizing the input data to cognoma. @mike19106 had the idea that we may be able to query into the API of these services and use their visualization of the same data.

The main point is that we should not try to reinvent the wheel visualizing these data since other services already do this very well. I mentioned similar services in passing in cognoma/cancer-data#10 but I will list them here as well:

CBioPortal COSMIC NCI GDC Broad Firehose

Someone from the frontend team could look into these service APIs (I know at least CBioPortal has one) to see if we can repurpose some of their visualizations.

gwaybio commented 7 years ago

also tagging @laszlokv for interest in other visualization services

BobMiller commented 7 years ago

Greg (@gwaygenomics),

I looked at each of the sites above to see if their was any kind of api to use their data viewers. Here's a summary of what I found:

Summary

The sites listed all have apis for viewing their data, but no apis we could use to view our data. However two of the sites: cBioPortal, and GDC Data Portal, are open source, so if there is a graph we like, we could lift the javascript code to create it in our website.

Detailed Notes

CbioPortal Portal API is for accessing their data, not their data viewers. So not much help for us. Of course seeing how they visual things could always point us in the right direction. They allow users to use their tools to visualize their own data, but we would need to fill out some forms, then load our data on their site – not a useful option for us. They do have two specific tools which anyone can view data without any forms or uploads. These tools use a web page as an interface, no api. Its possible that we could automate some kind of automated interface to their web page if we really liked the visualization. The tools are OncoPrinter: http://www.cbioportal.org/oncoprinter.jsp Mutation Mapper: http://www.cbioportal.org/mutation_mapper.jsp

On the plus side, their site is open source and on github, some maybe we could repurpose some code if we really liked a visualization. Their license is at https://github.com/cBioPortal/cbioportal/blob/master/LICENSE. From a quick read it looks like we can re-use their code if we want. As an aside, part of the team that maintains cBioPortal is right on Penn’s Campus at CHOP, so we might even have some local help. (At least get somebody to ask a quick question of).

Cosmic From the license it looks like nothing is open source, and there is no API for data viewers. So not much help here.

They do have a nice wheel shaped viewer for the Genomic Landscape of Cancer (http://cancer.sanger.ac.uk/cosmic). Maybe we could use this idea somewhere?

GDC Data Portal

This website is run by NIH. It has an API, but only for actual data, not for their Data Viewers. So it is similar to CBioPortal.

However, the source code is available on GITHub (https://github.com/NCI-GDC), and it is open source. Their is no overall license posted for the repository, instead they are posted piecemeal in each section. From reviewing the section licenses, it appears we are free to modify and redistribute their source code. So if there is a visualization we want to use, we could modify their source to suite our needs.

FireHose Browser

Like the others the API is only for viewing their data. No help for us. The site uses PlotViz to create many of its graphs. As of today PlotViz is not open source.

BobMiller commented 7 years ago

Oh yeah, as an alternative, I am looking into some open source javascript graphing packages to see if their is any we could use. If anyone has any suggestions, let me know.

gwaybio commented 7 years ago

Thanks for the thorough research @BobMiller - In your opinion, do you think it would be easier to adopt/modify these tools or build our own visualizations?

viewing their data, but no apis we could use to view our data.

For several of these databases, all of the data should be the same. However, without direct versioning and release control I think you're right to think of our data as "separate"

dhimmel commented 7 years ago

@BobMiller nice investigative research.

cBioPortal/cbioportal is licensed as GNU AGPLv3. AGPLv3 is a very pesky open source license. We'll need a legal expert to make sure we can use their code.

For the GDC Data Portal, US Government work is public domain. Oftentimes they will not state this (and may even put licenses on the work), but we should be able to use anything that is originally created by a government institute. I can open an issue about it if we find any code of theirs we want to use.

do you think it would be easier to adopt/modify these tools or build our own visualizations?

Let's approach this on a case-by-case basis. Basically our cancer-data and frontend teams will have a visualization in mind. Then we'll look to see if there's any code we can take from them. My guess is that we may be able to find closer open source visualizations outside of the TCGA sites. Our visualizations will probably be using common types of plots such as heatmaps and bar charts.

bdolly commented 7 years ago

@dhimmel @BobMiller my initial thoughts on the data visualization implementation is to use D3.js which is open source, really powerful, customizable and used across a ton of projects and has lots of extensions and plugins.

This will allow us to roll our own visualizations with our own data. However it will take sometime to setup and likely won't make it into the MVP but is a good candidate for later iterations.

awm33 commented 7 years ago

Our dashboarding software at work uses (Highcharts)[http://www.highcharts.com/], it's a lot easier to use than D3, but can't do super custom visualizations. I would suggest using highcharts or another charting library, then "graduating" to D3 when there is a need. I find a lot of people try to use D3 because it's very trendy right now, but they don't really need it, and it is more complex to use.

bdolly commented 7 years ago

@awm33 I will agree with you on that. D3 is a beast of it's own and could be a bit overkill, especially for a minimum viable product.

It's good that we have options on the table right now but I feel like we're still getting ahead of ourselves as we have very little ground work laid for the front-end angular stuff and no servers kicking us json to speak of.

dhimmel commented 7 years ago

Good point @awm33. I've also heard good things about vega. I'm a little worried that highcharts.js isn't open source since it forbids commercial use, this could restrict who can build on top of cognoma. Let's cross the visualization bridge when we get there!

BobMiller commented 7 years ago

I took a quick look at both Highcharts, and D3.js . As both Andrew @awm33 and Ben @bdolly have pointed out Highcharts is simpler to use, but D3 is more comprehensive. One big difference though, D3 is open source and Highcharts is not - it requires purchasing a license. This might be a deal breaker for this project. I also noticed that D3 seems to have a lot of explainers, examples, and tutorials. Even though it is more difficult, we may be able to copy an example and tweak it. I'm going to keep looking into this, as well as some other open source options.

The one thing that would help, is coming up with a minimal set of graphs/charts we should produce. This is where Greg @gwaygenomics , and Dan @dhimmel 's input is invaluable.

Here's what I know:

The first algorithm is a "Logistical regression with and elastic net using SGD". What this means from a UI perspective is that we are doing a Logistical Regression with a few extra parameters.

The main return for a logistical regression is a table of regression coefficients. These coefficients are the model that is produced, From this table of coefficients, we would want to show the largest magnitude coefficients in some form of chart (maybe a bar chart). The biggest coefficients have the largest effects on whether a gene mutation is a predictor of a specific cancer.

The algorithm also should return a table of results of how well the model did in predicting cancer with the test cases (tissue samples). This return will usually consist of a list of probabilities, and f values (or z scores) for each test case. From this data one can create ROC charts, area charts showing the statistical distribution of positives/negatives and their overlap, box and whisker charts. etc.

The question is what kind of charts are most critical/useful to the researcher?

dhimmel commented 7 years ago

@BobMiller let's start a new discussion for "post-classification visualizations".

See https://github.com/cognoma/machine-learning/pull/51 and hippo-output-schema.json for more information on the results coming out of machine learning.

dhimmel commented 7 years ago

Wanted to take note of two additional technologies: vega-lite and altair. Here's the deal:

altair is built atop vega-lite
vega-lite is built atop vega
vega is built atop D3

Will try to test out some of these frameworks to provide my opinion on their ease of use and functionality.

dhimmel commented 7 years ago

At a recent symposium I talked with the authors of altair and vega-lite. I'm now pretty convinced that this is the best way forward. Basically, the machine-learning team can use altair to create visualizations in our Python notebooks. These visualizations can be exported to a vega-lite JSON specification. The frontend can use the vega-lite specification and change the underlying data to produce in-browser visualizations.

cognoma / frontend

Visualizing pre-classifier data #13