SofieVG / FlowSOM

Using self-organizing maps for visualization and interpretation of cytometry data
61 stars 26 forks source link

Elbow plot variance calculation #46

Open Jeff87075 opened 2 years ago

Jeff87075 commented 2 years ago

Hi, I want to ask how the following formula for calculating the variance (that will be used in the elbow plot) is derived?

  c_wss <- 0
  for(j in seq_along(clustering)){
    if(sum(clustering == j) > 1){
      c_wss <- c_wss + (nrow(data[clustering == j, , drop = FALSE]) - 1) *
        sum(apply(data[clustering == j, , drop = FALSE], 2, stats::var))
    }
  }

I understand that the sum() part is calculating the within sum of squares but why does it have to be multiplied by what I assume is the degrees of freedom with the nrow() - 1? Thanks a lot!

SofieVG commented 2 years ago

Mm, I'm trying to remember. I would assume the main idea here was to take a weighted version (so larger clusters contributing more), I'm just not sure where the minus 1 is coming from, and whether this weighting with the number of datapoints is necessary in the first place... There certainly might be a mistake in this code, because it actually is not working that well, and typically when using FlowSOM we handpick the number of metaclusters rather than using this automated approach.

Jeff87075 commented 2 years ago

Ah I see, an automated approach certainly has its limitations. On the topic of the SOM algorithm, since I see that the flowSOM package has its own codes for performing the SOM, can I also ask what are the major differences between the SOM performed in flowSOM versus the SOM algorithm introduced by the kohonen package?

SofieVG commented 2 years ago

The FlowSOM package builds on the kohonen package as it was at the time, so in essence will be exactly the same. However, the code has been simplified, in the sense that some properties we did not expect to use (e.g. hexagonal or toroidal topologies) were removed, and some additional options have been added (for example, we explored some different distance measures although we keep using euclidean distance most of the time).

On Wed, 20 Oct 2021 at 16:17, Jeff87075 @.***> wrote:

Ah I see, an automated approach certainly has its limitations. On the topic of the SOM algorithm, can I also ask what are the major differences between the SOM performed in flowSOM versus the SOM algorithm introduced by the kohonen package?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SofieVG/FlowSOM/issues/46#issuecomment-947712409, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOS724B7W7CKZRRCBYRXUTUH3FPJANCNFSM5GDJTBTQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.