GeoDaCenter / geoda

GeoDa: An introduction to spatial data analysis
http://geodacenter.github.io
GNU General Public License v3.0
680 stars 155 forks source link

The question about the redcap algorithm #2467

Open JagnDC opened 7 months ago

JagnDC commented 7 months ago

I would like to ask a question about the REDCAP algorithm, I wrote the REDCAP algorithm according to the paper, but the calculation results seem to be different from the results of geoda, and some problems are found through comparison.

For FullOrder-AverageLinkage, my data property values are as follows:

[[6 5 4]
[8 9 3]
[5 9 0]
[8 7 8]
[4 7 1]]

The initial adjacency relation is as follows.

[[0.1.1.1.1.]
[1.0.1.1 1.0.]
[1.1.0.0.1.]
[1.1.0.0.0.]
[1.0.1.0.0.]]

I used this to cluster five nodes and set the number of clusters to two. My code calculates the following: {0: [0, 4, 1, 2], 1: [3]} geoda calculates the following: {0: [0, 1, 3], 1: [2, 4]}

I didn't know which of these calculations was correct, so I looked at the spanning tree, and I found that my algorithm built the following tree adjacency:

[[0.0.0.0.1.]
[0. 0. 1.
[0.1.0.0.1.]
[0.1. 0. 0.]
[1.0.1.0.0.]]

geoda's tree is:

0 5 "clusters" POLY_ID
3 5 6
5 3 6
1 2 21
21 21
1 5 17
5 1 17
1 4 24
4 1 24

After manual calculation, I found that according to my understanding, the results of the paper should be consistent with the results of my algorithm. May I ask if there is a problem with my calculation results? Or maybe there's something wrong in some other place that I don't understand. Thanks to the Author

lixun910 commented 7 months ago

Thanks for your question! @JagnDC Can you post your step-by-step manual calculation so we can check what might be the problem? You can see a step-by-step manual calculation using FullOrder-CompleteLinkage in our GeoDa workbook: https://geodacenter.github.io/workbook/9c_spatial3/lab9c.html#appendix

JagnDC commented 7 months ago

image Thank you for your reply. This is my example of the spanning tree construction process diagram. I don’t know if I calculated it correctly, but I checked it many times. Here, I use the FullOrder-AverageLinkage

lixun910 commented 7 months ago

Can you also post how did you update the centroid (with average linkage) of the merged cluster after making a connection? I didn't see it in your attached image. You can also post the dendrogram of the hierarchical clustering of average linkage, which helps to explain the calculation (like in the GeoDa documentation I posted). Thanks!

JagnDC commented 7 months ago

Thank you for your reply.

First we need to update the result above. In the last step I made a mistake, but the conclusion remains the same.

image

In addition, here are the distance matrix of each step and the result of merging. I don't know if you can understand my meaning

image

JagnDC commented 7 months ago

By the way, when I set the number of clusters to 1, will the spanning tree I get be a complete spanning tree? image

lixun910 commented 7 months ago

The calculation doesn't look correct to me: after merging 2 and 4, why all the sudden 3 has connection to 2?

Yes, It's better to post a screenshot of the dendrogram of SCHC, see in GeoDa documentation: Thanks!

A careful consideration of the various REDCAP algorithms reveals that they essentially consist of three steps. First, a dendrogram for contiguity constrained hierarchical clustering is constructed, using the given linkage function. This yields the exact same dendrogram as produced by SCHC. Next, this dendrogram is turned into a spanning tree, using standard graph manipulation principles. Finally, the optimal cuts in the spanning tree are obtained using the same logic (and computations) as in SKATER, up to the desired level of k.

JagnDC commented 7 months ago

image

Here's the dendrogram I drew and the SCHC dendrogram, and I've also labeled the order of the edges.

To be clear, I'm just building a spanning tree here.

For your question, you can see my clustering on the right side of the distance matrix diagram. Since 2 and 4 are merged into one cluster, cluster 2 in state 2 represents the set of cluster 2 and cluster 4 in state 1. When updating the distance between cluster 2 and other clusters in the calculation process, all distances are calculated. I devised a separate step to determine whether clusters are adjacent.

To avoid misunderstandings, I have updated the process plot of the distance matrix here.

image

lixun910 commented 7 months ago

Thanks for updating the screenshot! Yes, the SCHC represents the results in Redcap in GeoDa. In your state 2, how do you get 8.41861? since 3 is not connected to neither 2 or 4 from state 1.

On Nov 23, 2023, at 8:04 PM, dingchen @.***> wrote:



[image]https://urldefense.com/v3/__https://user-images.githubusercontent.com/20790537/285346690-6540311c-70b0-48bf-8e6b-45287f09d2b0.png__;!!BpyFHLRN4TMTrA!6b1peUbbLTzH4DmFKqM6021St7nGtLShyKTFD5deGFwYmmLKrRBjgP9BIswTVCOv_Sd1FNYjzJv1PmuIkfM2JrhiLA$

Here's the dendrogram I drew and the SCHC dendrogram, and I've also labeled the order of the edges.

To be clear, I'm just building a spanning tree here.

For your question, you can see my clustering on the right side of the distance matrix diagram. Since 2 and 4 are merged into one cluster, cluster 2 in state 2 represents the set of cluster 2 and cluster 4 in state 1. When updating the distance between cluster 2 and other clusters in the calculation process, all distances are calculated. I devised a separate step to determine whether clusters are adjacent. [image]https://urldefense.com/v3/__https://user-images.githubusercontent.com/20790537/285347433-c9fd998f-e7ee-4a2c-8833-c240179fdeba.png__;!!BpyFHLRN4TMTrA!6b1peUbbLTzH4DmFKqM6021St7nGtLShyKTFD5deGFwYmmLKrRBjgP9BIswTVCOv_Sd1FNYjzJv1PmuIkfNSKOOmzg$

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/GeoDaCenter/geoda/issues/2467*issuecomment-1825093881__;Iw!!BpyFHLRN4TMTrA!6b1peUbbLTzH4DmFKqM6021St7nGtLShyKTFD5deGFwYmmLKrRBjgP9BIswTVCOv_Sd1FNYjzJv1PmuIkfNkNfW6aQ$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AASPYTZ7EWYPQL7565UMSODYGAFEJAVCNFSM6AAAAAA7NTOWP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRVGA4TGOBYGE__;!!BpyFHLRN4TMTrA!6b1peUbbLTzH4DmFKqM6021St7nGtLShyKTFD5deGFwYmmLKrRBjgP9BIswTVCOv_Sd1FNYjzJv1PmuIkfNKae2lwg$. You are receiving this because you commented.Message ID: @.***>

JagnDC commented 7 months ago

Thank you for your reply. For your question, the average of all distances between two clusters I calculated here, because it is the calculation between feature values, so no judgment of adjacency is made, and constraints are made in the subsequent calculation process, so the final calculation results will not be affected.

However, I would like to thank you for reminding me. I have optimized my calculation process and reduced unnecessary calculation process.

lixun910 commented 7 months ago

You are welcome! Just noticed another one: the 4.57081 values in state 2 seems not correct.

All these calculations will impact the final results. The SCHC is a special case of classic hierarchical clustering with fully connected weights. The implementation is identical to other clustering libraries.

On Nov 23, 2023, at 9:38 PM, dingchen @.***> wrote:



Thank you for your reply. For your question, the average of all distances between two clusters I calculated here, because it is the calculation between feature values, so no judgment of adjacency is made, and constraints are made in the subsequent calculation process, so the final calculation results will not be affected.

However, I would like to thank you for reminding me. I have optimized my calculation process and reduced unnecessary calculation process.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/GeoDaCenter/geoda/issues/2467*issuecomment-1825138841__;Iw!!BpyFHLRN4TMTrA!9awoiDVpcRBA_hp4zAdo48tEoQlB8Sm5cFrUEEYBD_Aj8z1T8sEVcX7fGhMP-SQEABJcZrrVVNu2C49KFQDzWQ2h0g$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AASPYT7MAXCT3DEDLDZNS23YGAQFXAVCNFSM6AAAAAA7NTOWP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRVGEZTQOBUGE__;!!BpyFHLRN4TMTrA!9awoiDVpcRBA_hp4zAdo48tEoQlB8Sm5cFrUEEYBD_Aj8z1T8sEVcX7fGhMP-SQEABJcZrrVVNu2C49KFQBNv82aTA$. You are receiving this because you commented.Message ID: @.***>

JagnDC commented 7 months ago

Thank you for your reply. I see what you mean. I want to briefly explain my calculation process.

According to the full pattern in the original article, when I merge nodes 2,4 in state 1, I should calculate the distance between the other clusters, which are cluster 0 and cluster 1. You are confused about the distance between cluster 1 and cluster 2

dis(node 1, node 2)=√(9+0+9)=√18
dis(node 1, node 4)= √(16+4+4)=√24
avgdis(cluster 1, cluster 2)= (√18+ √24)/2=4.5708

dis(node 0, node 1)=√21
avgdis(cluster 0, cluster 2)= √21/1=4.5826

avgdis(cluster 1, cluster 2) < avgdis(cluster 0, cluster 2)
lixun910 commented 7 months ago

I am confused: among node 1-5, you are merging node 3 and 5 at first step from your screenshot. This is corresponding to row/column 2 and 4 in state 0 table. It should be simply the average of the two columns/rows. In state 1 table, 4.93383 is correct, and 4.57081 is not correct and some others are not correct as well. Please check and state carefully.

On Nov 23, 2023, at 10:45 PM, dingchen @.***> wrote:



Thank you for your reply. I see what you mean. I want to briefly explain my calculation process.

According to the full pattern in the original article, when I merge nodes 2,4 in state 1, I should calculate the distance between the other clusters, which are cluster 0 and cluster 1. You are confused about the distance between cluster 1 and cluster 2

dis(node 1, node 2)=√(9+0+9)=√18 dis(node 1, node 4)= √(16+4+4)=√24

avgdis(cluster 1, cluster 2)= √18+ √24=4.5708

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/GeoDaCenter/geoda/issues/2467*issuecomment-1825175890__;Iw!!BpyFHLRN4TMTrA!81WITMV5Zf3pokGKGavnV60tbxBtaJ6jlH1cnZLAcBau5QDZh39eAXglly0_aiytkw654W1W_6B3jAL5CNIU3EvlcA$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AASPYT3U2CNQVILOWJR4HSDYGAX5RAVCNFSM6AAAAAA7NTOWP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRVGE3TKOBZGA__;!!BpyFHLRN4TMTrA!81WITMV5Zf3pokGKGavnV60tbxBtaJ6jlH1cnZLAcBau5QDZh39eAXglly0_aiytkw654W1W_6B3jAL5CNIdLZ6Brg$. You are receiving this because you commented.Message ID: @.***>

JagnDC commented 7 months ago

image Thank you for your reply. I think my expression is not clear enough, so I updated the graph of distance matrix and refer to the format of your GeoDa documentation. As you can see in this graph, 4.57081 is the average of 4.2426 and 4.8989. I have repeated the calculation here many times, I think it should be no problem.

lixun910 commented 7 months ago

This is not correct: the state 0 table has values on all non-diagonal cells, which means each node connects with every other nodes. Please check carefully.

On Nov 23, 2023, at 11:38 PM, dingchen @.***> wrote:



[image]https://urldefense.com/v3/__https://user-images.githubusercontent.com/20790537/285377563-37dac79d-9189-4352-9f2a-d95ae3acf465.png__;!!BpyFHLRN4TMTrA!--KgXj2u4tSk03Hw8mgnXCg1dSpt5JMx7ovz8sJt7eS-ThZ4Jm5iZ6P16thOeRKJQa12PxSN1hG0RoDEv2f_EHL5-w$ Thank you for your reply. I think my expression is not clear enough, so I updated the graph of distance matrix and refer to the format of your GeoDa documentation. As you can see in this graph, 4.57081 is the average of 4.2426 and 4.8989. I have repeated the calculation here many times, I think it should be no problem.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/GeoDaCenter/geoda/issues/2467*issuecomment-1825210514__;Iw!!BpyFHLRN4TMTrA!--KgXj2u4tSk03Hw8mgnXCg1dSpt5JMx7ovz8sJt7eS-ThZ4Jm5iZ6P16thOeRKJQa12PxSN1hG0RoDEv2crc7rujA$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AASPYT2A5OTZNIT3KTH365DYGA6E5AVCNFSM6AAAAAA7NTOWP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRVGIYTANJRGQ__;!!BpyFHLRN4TMTrA!--KgXj2u4tSk03Hw8mgnXCg1dSpt5JMx7ovz8sJt7eS-ThZ4Jm5iZ6P16thOeRKJQa12PxSN1hG0RoDEv2c7Qyj-TA$. You are receiving this because you commented.Message ID: @.***>

JagnDC commented 7 months ago

image Thank you for your reply. What I refer to here is the composition of this matrix diagram in the GeoDa documentation. What I marked in the red box is the distance of adjacency relationship, the non-red box is the non-adjacency, and the yellow background color is the minimum distance of adjacent nodes as the merge target.

lixun910 commented 7 months ago

I see. Can you send me the files you used in GeoDa? So I can check the details. Thanks!

JagnDC commented 7 months ago

test_data.zip Thank you very much for your attention and answer. Here is my data, where feature1 and feature2 are the location coordinates, feature3-5 are the feature columns used, and clusters is the result of my clustering.

Feel free to contact me if you have any new findings or questions!