cognoma / machine-learning

Machine learning for Project Cognoma
Other
32 stars 47 forks source link

Add first version of Vega ROC plots #77

Closed patrick-miller closed 7 years ago

patrick-miller commented 7 years ago

This introduces a Vega based ROC plot. Without the interactivity it looks like the following:

image

Currently, it takes a CSV file/data stream, but we can use a JSON one instead depending on how the backend team wants to serve it. The inputs to it are the false positive rate, the true positive rate, the curve type (train, test, CV). I plan on adding in the ability to specify the data set used (or model) so that we can split out the full feature model and the covariates only model.

Let me know if you have any questions/comments.

dhimmel commented 7 years ago

@patrick-miller where do you think we should note the AUROC for each curve? Either as additional text in the legend or on hover?

dhimmel commented 7 years ago

I'm starting to think being able to compute the TPRs and FPRs to create an ROC in javascript would be killer. There are up to 33 different cancers that can be selected -- users may be interested in selecting certain cancers, which will filter to a subset of samples (observations). Thus the ROC curve would change.

We could always have the backend recalculate, if doing this on the frontend is too burdensome. Not that this decision or implementation should be part of this PR. Just wanted to jot down my thoughts and get your opinion.

patrick-miller commented 7 years ago

I'll put some thought to it though I doubt I will have any strong opinions between the versions. I think vega 3 is still in development. As for vega vs. vega-lite, you can definitely do more with vega -- I'm not sure if you have the ability to do any interactive stuff with vega-lite (I have only used vega in the past).

There are a few different places we could put the AUROC. We can put it in the legend like you have been doing in Python. We can put it on hover (would switch to keeping hover on permanently). We can put it to the right of the lines. I'll play around with adding it in some different places in a separate pull request.

In terms of the way the data is going to be served...anytime a user filters to a subset of cancers we would need to make a server side call to the data set, correct? Or are you imagining storing all of the prediction data in the frontend? We can certainly move a step to the frontend, I'm just not sure if this will really speed things up that much if you have the data cached in Redis on the backend anyway. Correct me if I'm wrong, but isn't the difference just IO?

dhimmel commented 7 years ago

Correct me if I'm wrong, but isn't the difference just IO?

IO and programming language. The javascript method could be done entirely client side. Otherwise, we can use python via the backend to compute the ROC curve.

In terms of the way the data is going to be served...anytime a user filters to a subset of cancers we would need to make a server side call to the data set, correct?

Unless we load the entire prediction table into the browser. This table is at most 8,000 rows, so it's a possibility.

Let's defer any decisions here until we have a better idea of the results viewer.

patrick-miller commented 7 years ago

Here is how the visualization looks now. We can play with how the interactivity works once I start putting together the AUROC for each curve.

image

patrick-miller commented 7 years ago

Made the small tweaks and switched to dashed lines for the covariates. It wasn't exactly straightforward, so there may be an easier way that I couldn't find to do it. Latest update:

image

dhimmel commented 7 years ago

@patrick-miller nice. I'm thinking we want to remove the dots (and keep just the lines), since there can be thousands of actual points in some of our ROC curves.

For the "feature set" legend, is it possible to use a line rather than a point to show the difference between solid and dashed. No big deal if this is too difficult.

Also, how hard is it to add some transparency/alpha to the lines... I'm thinking we may have overlapping ROCs.

dhimmel commented 7 years ago

Would love to get you some real data to plug in.

patrick-miller commented 7 years ago

Agreed on removing the dots, they are placeholders for now for the interactive portion -- still considering how I would want to best display it (thoughts are very welcome!)

I'll switch the legend to a line, I'm pretty sure it should be possible.

Transparency should be easy, I'll play around with some values. I'll do a data dump from one of the notebooks so that I can work out which values will be better.

patrick-miller commented 7 years ago

I added 'real' data for the ROC plot (comes from the 2.TCGA-MLexample notebook) -- for the covariates only model I fabricated the data. I took out the dots to make the rendering faster, but we will probably want to sample from the full ROC data that sklearn outputs (too many FPR and TPR breaks).

Things left to decide on: interactivity and where to put the AUROC for each feature set/partition split.

image

dhimmel commented 7 years ago

@patrick-miller, looks great and thanks for creating the more realistic data.

I took out the dots to make the rendering faster, but we will probably want to sample from the full ROC data that sklearn outputs

Since most points in our ROC curve lie on the line and are not actually inflection points, we can prune many of the points without any change to the curve! Here is an R implementation of this method. It shouldn't be hard for us to implement this in python.

Things left to decide on: interactivity and where to put the AUROC for each feature set/partition split.

For the AUROC, I think the two options are in the tooltip that appears on hover or in an additional legend. The additional legend could just contain the linetypes and the AUROC%.

patrick-miller commented 7 years ago

I got some interactivity working. It isn't perfect, but it is definitely a start.

image

dhimmel commented 7 years ago

I got some interactivity working. It isn't perfect, but it is definitely a start.

Looks great. My only suggesting would be making AUC a percentage, and making the TPR FPR and AUC percentages to have 1 decimal point of precision... like TPR 88.1%.

patrick-miller commented 7 years ago

Ok, I formatted the interactive legend to have 1 decimal point and all three figures are %s.

dhimmel commented 7 years ago

Great. I got the visualization up and running locally. See

vega-roc

I noticed the box overlaps with the AUC percentage sign. Is there an easy fix. If not, I'm happy to merge as is! Thanks for seeing this PR through. Can't wait till we deploy it.

patrick-miller commented 7 years ago

Yep, it is very easy. Right now, a lot of those parameters are hard coded, so I'm going to look at changing that in the future.