cognoma / machine-learning

Machine learning for Project Cognoma
Other
32 stars 47 forks source link

TOTAL column is skewing the heatmap in 3.TCGA-MLexample_Pathway #90

Closed George-Zipperlen closed 7 years ago

George-Zipperlen commented 7 years ago

Hi, @dhimmel and @gwaygenomics

I hope that this is the right place to ask a question about the code in 3.TCGA-MLexample_Pathway.ipynb

I'm working on converting the first heatmap: "percentage of different mutations across different cancer types" from seaborn to to Altair/vega-lite, continuing from the fine work of @superkostya.

I have figured out how to use different color maps in vega-lite, e.g. the viridis color map.

image

The 'TOTAL' column is not really a gene, and because it is a sum, it's values are much larger than the gene expression values, causing the differences between other values to be less apparent in the display.

I can move this column to the right of the chart with some slicing and dicing, but I'm not sure it really belongs.

here are the relevant lines from cognoma/machine-learning/3.TCGA-MLexample_Pathway.ipynb which create the 'TOTAL' column:

unique_pos = y.groupby('disease').apply(lambda x: x['indicator'].sum())
heatmap_df0 = y_full.groupby('disease').sum().assign(TOTAL = unique_pos)
heatmap_df = heatmap_df0.divide(y_full.disease.value_counts(sort=False).sort_index(), axis=0)

It is not clear to me what the TOTAL column means after the 3rd line does the divide operation, is it now some kind of average?

Thanks for any clarification.

dhimmel commented 7 years ago

For reference, see the related PR https://github.com/cognoma/machine-learning/pull/91.

It is not clear to me what the TOTAL column means after the 3rd line does the divide operation, is it now some kind of average?

The Total column indicates what percentage of the samples for a given disease have a mutation in any of the displayed genes. In other words, the total column is always guaranteed to be the max frequency for a given disease.

George-Zipperlen commented 7 years ago

Thanks @dhimmel