Closed bw4sz closed 6 years ago
hi @bw4sz. In diceR, the labels that result from each clustering are matched to one another before voting takes place. This is done using the Hungarian algorithm. This essentially cross-tabulates the cluster assignment from any two algorithms and optimizes the order of the rows and columns to try to maximize the elements on the diagonal. If the assignment across clusterings is identical this would be easy simply shifting rows and columns to make sure that all observations lie on the diagonal.
I can think of a few reasons that could be causing the algorithm to be sensitive to the reference clustering. (1) the agreement between cluster assignment is weak so it is difficult to optimize the diagonal (2) the number of classes is not the same across clusterings. Have you investigated these?
Hi @AlineTalhouk thanks for taking the time to respond. Clearly i'm not using the method incorrectly, since 2) is definitely true. Every method produces a different number of clusters. While not wildly different, there is no constrain on cluster size. Let me know if this a fatal flaw in using the diceR methods.
@bw4sz The problem when the number of clusters differs between clusterings, is that cases in one cluster (from one algorithm )can be split across 2 or more clusters in another. Label matching becomes hopeless.
Have you considered the other ensembling methods in diceR? CSPA will likely not encounter this problem for example.
The majority voting in diceR operates on the algorithm level for a fixed cluster size. If you cannot constrain the cluster size then this ensemble method would not be appropriate for your use case.
I am developing a R package for segmenting lidar point clouds into trees for forestry scenes. I think the diceR package will be a huge benefit to our approaches. We have multiple segmentation methods, each predicts the group membership of each point. The total number of groups is not equal across segmentation methods. I believe that this means I need to use the is.relabelled = TRUE flag in majority_voting.
I have a working version of majority voting implemented, but I'm finding that the order of the data makes a large difference in the final solution. I can see in the source that the is.relabelled flag means the other columns will be assignment an identity using a hungarian assignment based on the minimizing distances among labels.
Below is a reproducible example (if a bit heavy). Hopefully the plots will make this clear. For viewing, i'm just showing the spatial convex polygons that define the clustering of points.
Yields
These are 4 spatial clustering results that we hope to create a consensus.
Attempt 1
This creates unrealistic clusters, with combined clusters from far across the space.
Change order of inputs
This creates a more satisfactory result. This seems to be because the li method (first column in attempt #2), has a higher number of potential clusters, and allows the method to find a more optimal solution of n=259. Wheres the silva method (first column in the first attempt) has only 250 clusters, and therefore the Hungarian assignment limits the total number of clusters to 250.
Are there recommendations for how to design the input matrix to majority_voting to maximize its utility?
Source for consensus method