XuegongLab / HGC

fast hierarchical clustering for large-scale single-cell data
8 stars 5 forks source link

Invalid hclust merge when applying HCG on iNMF reduction matrix #4

Closed Fan-iX closed 1 year ago

Fan-iX commented 1 year ago

I meet with a problem when applying HGC on iNMF reduction matrix produced from liger:

inmf_reduction (csv) is a 44707×30 matrix

inmf_reduction<-read.csv("inmf_reduction.csv",row.names=1)
head(inmf_reduction)
                           iNMF_1      iNMF_2       iNMF_3       iNMF_4      iNMF_5       iNMF_6       iNMF_7       iNMF_8       iNMF_9      iNMF_10
Sp1_AAACCCAAGGGTATAT 8.164789e-05 0.034269749 1.785032e-06 0.0108770791 0.001278416 0.0001371894 1.379626e-05 0.0151301770 0.000000e+00 0.0006967699
Sp1_AAACCCAAGGTTCATC 1.128963e-05 0.005227889 0.000000e+00 0.0000000000 0.001564796 0.0001070065 6.425552e-03 0.0263891805 4.376199e-03 0.0000000000
Sp1_AAACCCACATACAGGG 9.636863e-04 0.006381059 5.044273e-06 0.0000000000 0.018281656 0.0025857656 0.000000e+00 0.0163463670 5.876425e-05 0.0028776468
Sp1_AAACCCACATCTTAGG 1.934265e-05 0.024496288 1.801177e-07 0.0003104039 0.000000000 0.0023395675 5.914556e-05 0.0003487214 0.000000e+00 0.0006140812
Sp1_AAACCCAGTATGGAGC 1.128963e-05 0.000000000 0.000000e+00 0.0000000000 0.012151908 0.0001070065 2.078746e-05 0.0185663557 1.757779e-02 0.0055571685
Sp1_AAACCCAGTGGCAGAT 1.128963e-05 0.006707858 0.000000e+00 0.0021757850 0.006537973 0.0016691463 1.850984e-03 0.0132309002 0.000000e+00 0.0013520569
                          iNMF_11     iNMF_12      iNMF_13      iNMF_14      iNMF_15      iNMF_16      iNMF_17     iNMF_18      iNMF_19      iNMF_20
Sp1_AAACCCAAGGGTATAT 0.0006521757 0.016658976 1.918434e-02 0.0354629859 0.0040416400 0.0042927028 5.218977e-07 0.002251353 3.831437e-04 1.489017e-06
Sp1_AAACCCAAGGTTCATC 0.0165076125 0.043522245 6.919273e-05 0.0144072126 0.0000000000 0.0108909556 2.584514e-06 0.001210561 2.434510e-04 1.122687e-03
Sp1_AAACCCACATACAGGG 0.0003425262 0.055064168 2.539523e-05 0.0004233692 0.0005903107 0.0084124759 0.000000e+00 0.035585981 3.445103e-04 0.000000e+00
Sp1_AAACCCACATCTTAGG 0.0103358743 0.000000000 3.503810e-03 0.0010784095 0.0009764907 0.0003771787 2.515179e-05 0.060048242 8.780525e-05 1.940847e-03
Sp1_AAACCCAGTATGGAGC 0.0203467424 0.008803649 8.236975e-03 0.0154844767 0.0000000000 0.0008482288 2.584514e-06 0.003812347 2.434510e-04 9.823728e-05
Sp1_AAACCCAGTGGCAGAT 0.0031628305 0.045095806 4.715061e-03 0.0082732912 0.0020627070 0.0008482288 2.584514e-06 0.046753821 4.413664e-04 0.000000e+00
                          iNMF_21      iNMF_22      iNMF_23      iNMF_24     iNMF_25      iNMF_26     iNMF_27     iNMF_28     iNMF_29      iNMF_30
Sp1_AAACCCAAGGGTATAT 0.0001568022 3.024277e-05 0.000000e+00 0.000000e+00 0.022783181 0.0000000000 0.000000000 0.000000000 0.001587582 1.690898e-06
Sp1_AAACCCAAGGTTCATC 0.0040136850 1.673711e-02 0.000000e+00 3.281007e-08 0.000250313 0.0000000000 0.033617251 0.000000000 0.000000000 0.000000e+00
Sp1_AAACCCACATACAGGG 0.0245308743 2.730180e-03 0.000000e+00 3.365313e-03 0.002721621 0.0002730902 0.004930925 0.000000000 0.000000000 1.966945e-06
Sp1_AAACCCACATCTTAGG 0.0000000000 0.000000e+00 0.000000e+00 4.151699e-06 0.000000000 0.0000000000 0.009290096 0.002972926 0.000000000 5.180654e-07
Sp1_AAACCCAGTATGGAGC 0.0036710963 1.888955e-05 6.221366e-05 1.020160e-03 0.000250313 0.0000000000 0.029002092 0.001931593 0.002331859 0.000000e+00
Sp1_AAACCCAGTGGCAGAT 0.0161980582 1.804841e-03 0.000000e+00 3.281007e-08 0.002741882 0.0000000000 0.027027089 0.000000000 0.000000000 0.000000e+00

I create a hierarchical clustering tree from the reduction matrix as following:

G <- HGC::SNN.Construction(inmf_reduction, k = 25, threshold = 1/15)
rownames(G) <- colnames(G) <- rownames(inmf_reduction)
hc <- HGC::HGC.dendrogram(G)

Then there seems to be something wrong with the clustering tree:

df <- as.data.frame(hc$merge)
df$n <- 1:nrow(df)
df[df$V1 > df$n | df$V2 > df$n, ]
          V1    V2     n
13500 -28367 13504 13500
13501 -42048 13505 13501

The merge component of a hclust object shows at each step which two items are combined, so normally the value in each row (V1 and V2) should be smaller than the row number (n). Here we get some n < V2 in the tree, which means that in step 13500, we sould combine leaf node 28367 with the node produced in (a future) step 13504. This leads to a "non-monotonic" tree, and causes problem when plotting:

plot(hc)
Error in cl[[2L]]: subscript out of bounds
Traceback:

1. plot(hc)
2. plot.hclust(hc)
3. deparse(cl[[2L]])

Is this an expected behavior? Can I apply HGC to an iNMF reduction matrix?

zouzh14 commented 1 year ago

Thanks for the reply! I will repeat the experiment with the data and check the output soon.

zouzh14 commented 1 year ago

Dear Fan-iX, the problem happens when re-ordering the dendrogram. The current version is based on an intrusive design that ordering the merging steps with increasing distances between leaves. I am adding a step to check the dendrogram before output.

So the problem is about the re-ordering function and not directed to the iNMF data.

zouzh14 commented 1 year ago

Hi @Fan-iX, here is an update of the packages to fix the ordering bug I mentioned above, and it past the test of your iNMF reduction matrix in our environment. Could you try it when you are avilable?

Here is the link of the repaired package

https://drive.google.com/file/d/1ViWTnF0dKNs5L5KfD0QaSINflWmO5JrG/view?usp=sharing. 

Also the bug will be fixed in the next update of HGC package, thanks for providing the valuable instance.

Fan-iX commented 1 year ago

This version works for other iNMF matrixes in my environment as well. Thanks all!

Fan-iX commented 1 year ago

Hello, @zouzh14 , There is a small problem with your fixed version of HGC: the height component of the hclust is not monotonic.

# the same `hc` object created from inmf_reduction.csv, as above
head(rank(hc$height),n=50)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 47 48 49 50 51 52 # missing 27, 46

The "missing" ranks are at

data.frame(height=hc$height,rank=rank(hc$height))[14385:14395,]
            height  rank
14385 3.126114e-03 42015
14386 3.162925e-03 42051
14387 3.186890e-03 42075
14388 1.615852e-05    27
14389 2.233029e-05    46
14390 2.411827e-05    59
14391 2.559428e-05    67
14392 2.569810e-05    68
14393 2.668064e-05    74
14394 2.930629e-05    85
14395 3.023655e-05    91

, showing its non-monotonicity.

Usually, a "native" hclust object (created by the clust function) should have a monotonically increasing height. Now, I can use

hc <- as.hclust(as.dendrogram(hc))

to make it monotonic, but it takes a long time and a lot of unnecessary computation.

I expect HGC could produce monotonic hclust object natively as before.

Thank you!

Fan-iX commented 1 year ago

I realized that the tree is merged according to the branch size in the fixed version. This provides some valuable information for my downstream analysis, so it is acceptable for me.

But it would be nice if you could add an argument to specify the output order for the function, since both ordering makes sense:

zouzh14 commented 1 year ago

Thx for sharing the experience! Sure the tree structure itself does not change that we expect the clustering results in each layer keep same. And the prime change is the output ordering of leaves and middle nodes. I will find a way to deal with it in next version.