AlineTalhouk / diceR

Diverse Cluster Ensemble in R
https://alinetalhouk.github.io/diceR/
Other
34 stars 10 forks source link

How to choose which clustering to use as 'reference column' in majority_voting? #128

Closed bw4sz closed 6 years ago

bw4sz commented 6 years ago

I am developing a R package for segmenting lidar point clouds into trees for forestry scenes. I think the diceR package will be a huge benefit to our approaches. We have multiple segmentation methods, each predicts the group membership of each point. The total number of groups is not equal across segmentation methods. I believe that this means I need to use the is.relabelled = TRUE flag in majority_voting.

I have a working version of majority voting implemented, but I'm finding that the order of the data makes a large difference in the final solution. I can see in the source that the is.relabelled flag means the other columns will be assignment an identity using a hungarian assignment based on the minimizing distances among labels.

Below is a reproducible example (if a bit heavy). Hopefully the plots will make this clear. For viewing, i'm just showing the spatial convex polygons that define the clustering of points.

library(lidR)
library(devtools)
install_github("Weecology/TreeSegmentation")

#Load toy data
LASfile <- system.file("extdata", "MixedConifer.laz", package="lidR")
las = readLAS(LASfile)

#Segmentation methods (takes about 2 minutes.)
silva<-silva2016(tile=las,output="all")
dalponte<-dalponte2016(tile=las,output="all")
li<-li2012(tile=las,output="all")
watershed_result<-watershed(tile=las,output="all")
#View spatial clustering
plot(silva$convex)
plot(dalponte$convex)
plot(li$convex)
plot(watershed_result$convex)

Yields

image

image

image

image

These are 4 spatial clustering results that we hope to create a consensus.

Attempt 1

ptlist<-list(silva=silva$tile,dalponte=dalponte$tile,li=li$tile,watershed=watershed_result$tile)
consensus_result<-consensus(ptlist=ptlist,method="majority")
consensus_polygons<-get_convex_hulls(consensus_result,consensus_result@data$treeID)
plot(consensus_polygons)
paste(length(consensus_polygons),"consensus clusters found")
length(silva$convex)
> paste(length(consensus_polygons),"consensus clusters found")
[1] "250 consensus clusters found"
> length(silva$convex)
[1] 250

image

This creates unrealistic clusters, with combined clusters from far across the space.

Change order of inputs

ptlist<-list(li=li$tile,watershed=watershed_result$tile,silva=silva$tile,dalponte=dalponte$tile)
consensus_result<-consensus(ptlist=ptlist,method="majority")
consensus_polygons<-get_convex_hulls(consensus_result,consensus_result@data$treeID)
plot(consensus_polygons)
paste(length(consensus_polygons),"consensus clusters found")
length(li$convex)
> length(li$convex)
[1] 381

> paste(length(consensus_polygons),"consensus clusters found")
[1] "259 consensus clusters found"

image

This creates a more satisfactory result. This seems to be because the li method (first column in attempt #2), has a higher number of potential clusters, and allows the method to find a more optimal solution of n=259. Wheres the silva method (first column in the first attempt) has only 250 clusters, and therefore the Hungarian assignment limits the total number of clusters to 250.

Are there recommendations for how to design the input matrix to majority_voting to maximize its utility?

Source for consensus method

AlineTalhouk commented 6 years ago

hi @bw4sz. In diceR, the labels that result from each clustering are matched to one another before voting takes place. This is done using the Hungarian algorithm. This essentially cross-tabulates the cluster assignment from any two algorithms and optimizes the order of the rows and columns to try to maximize the elements on the diagonal. If the assignment across clusterings is identical this would be easy simply shifting rows and columns to make sure that all observations lie on the diagonal.

I can think of a few reasons that could be causing the algorithm to be sensitive to the reference clustering. (1) the agreement between cluster assignment is weak so it is difficult to optimize the diagonal (2) the number of classes is not the same across clusterings. Have you investigated these?

bw4sz commented 6 years ago

Hi @AlineTalhouk thanks for taking the time to respond. Clearly i'm not using the method incorrectly, since 2) is definitely true. Every method produces a different number of clusters. While not wildly different, there is no constrain on cluster size. Let me know if this a fatal flaw in using the diceR methods.

AlineTalhouk commented 6 years ago

@bw4sz The problem when the number of clusters differs between clusterings, is that cases in one cluster (from one algorithm )can be split across 2 or more clusters in another. Label matching becomes hopeless.

Have you considered the other ensembling methods in diceR? CSPA will likely not encounter this problem for example.

dchiu911 commented 6 years ago

The majority voting in diceR operates on the algorithm level for a fixed cluster size. If you cannot constrain the cluster size then this ensemble method would not be appropriate for your use case.