handle categorical data legend

DingWB / PyComplexHeatmap

PyComplexHeatmap: A Python package to plot complex heatmap (clustermap)

https://dingwb.github.io/PyComplexHeatmap/

MIT License

249 stars 28 forks source link

handle categorical data legend #26

Closed faridrashidi closed 1 year ago

faridrashidi commented 1 year ago

Following issue #25. As a test example:

import pandas as pd
from PyComplexHeatmap import *

data = pd.DataFrame(
    [
        [0, 1],
        [1, 0],
        [0, 1],
        [1, 0],
        [0, 1],
    ]
)
ClusterMapPlotter(
    data=data,
    linewidth=1,
    row_cluster=False,
    col_cluster=False,
    cmap=["blue", "red"],
    legend_width=20,
)

faridrashidi commented 1 year ago

I agree with the first suggestion but I didn't get why you're trying to allow having separate colors for different columns. By "columns" do you mean the unique "values" of the data?

DingWB commented 1 year ago

For example, if you have the following data frame:

[
[A, Tumor, Stage 2],
[B, Normal, Stage1],
[C, Tumor, Stage3],
[D, Normal, Stage 2]
]

We have 3 columns here, but we do not want the use the same colormap for those 3 columns. In this case the unique should not be [A,B,C,D,Tumor, Normal, Stage1, Stage2, Stage3]. We should use different unique for different columns and use different colormap (Set1, Dark2.)

faridrashidi commented 1 year ago

I see, I think it's hard or no clean way to handle because then you have to add multiple legends.

DingWB commented 1 year ago

Oh, actually. It has already been implemented in the current version. Please see the following example:

data = pd.DataFrame(
    [
        [0, 1],
        [1, 0],
        [0, 1],
        [1, 0],
        [0, 1],
    ]
)
data.columns=['Col1','Col2']
data.Col1=data.Col1.astype(str)
data.Col2=data.Col2.astype(str)
print(data)

plt.figure(figsize=(8, 4))
col_ha = HeatmapAnnotation(
                            Col1=anno_simple(data.Col1,add_text=True),
                            Col2=anno_simple(data.Col2,add_text=True),
                           plot=True,legend=True,axis=1,
                            legend_gap=5,orientation='up',hspace=0.1
                            )
plt.show()

Please refer to this documentation for more examples.

faridrashidi commented 1 year ago

Ok. Regarding your suggestion (3), to clarify I assumed that if the cmap is a list, the data is categorical, and if it is a string, the data is continuous. However, I'm not entirely certain how to distinguish between continuous and categorical data if cmap is used as a string.

DingWB commented 1 year ago

How about we recognize the dtypes automatically, using df.dtypes, if str is included in df.dtypes, then we treat the whole dataframe as categorical. Otherwise the dataframe should be treated as continuous.

RamRS commented 1 year ago

For example, if you have the following data frame:
[
[A, Tumor, Stage 2],
[B, Normal, Stage1],
[C, Tumor, Stage3],
[D, Normal, Stage 2]
]
We have 3 columns here, but we do not want the use the same colormap for those 3 columns. In this case the unique should not be [A,B,C,D,Tumor, Normal, Stage1, Stage2, Stage3]. We should use different unique for different columns and use different colormap (Set1, Dark2.)

I disagree with this. I use categorical data points in heatmaps, and do not wish to have per variable value legends. For example, in Breast Cancer, I need Positive/Negative for ER/PR/HER2 status. All I need is one color for positive and one for negative across the board. Someone that needs TumorStage2 to be different than NormalStage2 needs to have those 2 variables in the same data point or represent each of the varaibels differently, like ggplot would do - shape for one and color for the other, for example.

DingWB commented 1 year ago

I see. Thanks for your feedback. Do you have any idea how to implement it?

RamRS commented 1 year ago

I think one way would be to ensure the df is all categorical or all continuous, then if all categorical, flatten to a 1D array and get unique elements, then ensure all unique elements are accounted for in the legend/color mapping. Once that's done, plotting should be straightforward IMO.

I do not know python well (I don't use pandas or matplotlib) so I could well me missing something here.

DingWB commented 1 year ago

I closed this pull request as I added a new module oncoPrint to the latest version (1.3.9). You can use oncoPrint to plot categorical data.