MaayanLab / clustergrammer-widget

The Clustergrammer interactive Jupyter notebook widget
http://nbviewer.jupyter.org/github/MaayanLab/clustergrammer-widget/blob/master/Running_clustergrammer_widget.ipynb
MIT License
47 stars 27 forks source link

Members of a K-means clsuter #3

Open Jonathan-Abrahams opened 6 years ago

Jonathan-Abrahams commented 6 years ago

Hi,

I am struggling to find the right documentation that details which rows or columns have been clustered into a specific K-means cluster.

Is this feature available? Or how would you suggest is the best way to go about doing this?

I have been looking in the notebooks of the examples but cannot find it. The cytof notebook does detail a similar process, but it is more complicated.

cornhundred commented 6 years ago

Hi,

This feature is available but the documentation does not currently discuss it. This information is returned as a NumPy array by the downsample method. See below for an example

ds_data = net.downsample(axis='row', ds_type='kmeans', num_samples=5)

This array (referred to as ds_data above) is the same length as the original rows/columns and the integer in each element refer to the cluster each row/column has been assigned to. Please see this quick example notebook that goes into more detail. We will also update the documentation to address this and thank you for bringing this to our attention. Let us know if this works or if you have any other questions.

Jonathan-Abrahams commented 6 years ago

Thank you for your fast response!

Thats very heplful and does solve my original query.I am making good progress on applying this to my own data.

Now I am wondering what the best way to display this on the heatmap would be?

cornhundred commented 6 years ago

Great, I'm glad that helped. I updated the example notebook to show how the K-mean cluster ids can be overlayed on the original data by adding an additional row category (see below).

screen shot 2017-11-28 at 10 33 23 am

Let us know if that answered your question and if you have any other questions.

Jonathan-Abrahams commented 6 years ago

My dataset consists of 1000 bacterial strains and data relating to their ~3000 genes. My primary motivation for downsampling is to simplify the heatmap to a manageable size.

The solution you have proposed does not solve this specific motivation of mine as it is the same size as the original dataset. I can see useful tips in your update on modifying labels and adding columns.I am sure I will be able to incorporate these at a later date.

My Ideal solution would almost be the reverse of the solution you proposed. the K-means clustered heatmap with details as to which gene is represented by which K-means.I can see many problems with what I am proposing. I am trying it out myself. This may also go against my aims of simplfying the data. What do you think?

I have been able to plot heatmaps of individual K-means clusters but this is not nearly as elegant as is possible, im sure!

It seems as though having gene names in one column beside the K-means would be a messy way(and probably impossible) to show such information.

cornhundred commented 6 years ago

I see, it sounds like your matrix is ~1,000 columns/strains by ~3,000 rows/genes and you are looking to reduce the size of your dataset to something more manageable. It will probably be difficult to show the gene list of the downsampled clusters (and this is not currently supported by Clustergrammer).

I would recommend a couple of things based on our experience with similar datasets.

We used Clustergrammer to visualize the Cancer Cell Line Encyclopedia which is ~1,000 columns/cancer-cell-lines by ~20,000 rows/genes (see CCLE Notebook). We first filtered for the top 1,000 most variable genes and then downsampled our cell-lines to obtain 100 cell line clusters (downsampling also keeps track of the most common category in each cluster). So if you can filter your genes down (based on variance or sum) then something like this might be useful. The MNIST notebook also does something similar. If you can add some category to your genes, then this would be tracked with the downsampling, but it is not exactly what you are asking for.

Finally, the next version of Clustergrammer will be built in WebGL, which can handle much larger datasets like these. Here's a very simple visualization of a random matrix of 1,000 rows by 1,000 columns built in WebGL to demonstrate how much data can be handled. We can keep you up to date on this progress.

Jonathan-Abrahams commented 6 years ago

The link to the notebook you made is now dead unfortunately!

cornhundred commented 6 years ago

Which notebook? Can you provide the link because the links I checked on this thread still appear to work.

Jonathan-Abrahams commented 6 years ago

You are right!

I should have checked today.

Yesterday I am certain they were down.

cornhundred commented 6 years ago

No problem, did the approaches we recommended work out?