Add cluster analysis to R-Instat

rdstern commented 3 years ago

This is one of the multivariate methods mentioned in WMO 100. We already have the others, namely correlations, PCA, Canonical correlations, so this is the one tool missing.

It fits well, in that we are roughly limiting ourselves currently to descriptive methods.

There is a CRAN cluster analysis task view that lists 100 packages. Two are distributed with base R, namely stats and cluster.

There is a good case study here. It proposes the package factoextra be used. Interestingly that isn't in the list in the task view packages. But their arguments seem sound and we already have factoextra in R-Instat. Also they use ggplot for the graphs - which of course we like! (Aha the article is written by the author of factorextra, and he also has a book called "Practical Guide to Cluster Analysis in R")

Update: I was concerned that factorextra was not in the task view, so wrote to the author to enquire why. It is now added and I suggest we make use of it.

As background, there are 2 distinct forms of clustering: a) Hierarchical: going up or down you could start with each observation as its own cluster, then put the 2 closest together. Or start with everything in one big cluster and progressively move cases out. b) Partitioning clustering where you specify the number of clusters, e.g. 4 and then see which cases to put in which cluster.

Here is an article:

If we proceed, then it would be an additional dialogue - or perhaps 2 dialogues - under Describe > Multivariate.

A related aspect is a heat map. We need this sort of graph anyway, in our Specific Plots, where it is for a variate, like yield, as a function of 2 factors, e.g. farmer and treatment.

Heat maps are also often constructed from Variables by rows, and then dendrograms are added. This is all part of cluster analysis and we should include that aspect of heat maps as part of clustering. I suggest we need both.

In data science, classification is a big part and a distinction is made between supervised and unsupervised classification. Cluster analysis IS unsupervised classification!

rdstern commented 3 years ago

I think I now have the methods to use for clustering and it is simpler than I thought! That's a relief. I am now happy with the book called "Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning"

More than that we can follow the book - which like our others recently - can be read online freely. And that's useful, because the code can be followed.

Cluster analysis needs the data to be "prepared". That's fine we have a prepare menu. So I suggest a preliminary - which follows the commands in the first chapters of the book is a new Transform dialogue. This can go in the Prepare > Data Reshape menu.

I suggest 3 (no 2) buttons at the top, namely (Omit.na), Scale, Distance. The selector, in each case, allows either a data frame or a set of variables from a data frame. It returns a new matrix (from dist) and data frame, from scale. I assume a matrix can be displayed as a new data frame and still be treated as a matrix if it qualifies?

I now assume we include the omit.na on feature on the scale button and permit just the omit.na to be done if that is all that is wanted.

In dist we also include the more general daisy function.

Then we have clustering using the cluster package and probably factorextra to give ggplots of the results. I assume we can have a single dialogue with 2 buttons for Hierarchical and Non-hierarchical, but let's see.

This can all follow the materials in the first 7 chapters of the book. That's the initial tasks. and is up to page 78 in his book He then has to page 116 on dendrograms and heat maps. It is much less obvious what we do there, so let's leave that for later!

Vitalis95 commented 3 years ago

@rdstern I have gone through the issue for the first part (preparing the data) and here is the summary of what I understood.

The dialog should have two buttons at the top; Scale and Distance
Under the scale button we standardize the data using the scale() function. Here also, we remove any missing value that might be present in the data using na.omit() function. I think it should be a check box linking the scale button.
For computing distances between pairs we have dist() for numeric data and daisy() for other variable types. The functions should have checkboxes, nud and combobox for the parameters. Then the distance vector (distance generated) are reformatted into a matrix using the as.matrix() function.

I am sketching the dialog, I will share with you.

rdstern commented 3 years ago

@Vitalis95 I think you are missing some checkboxes and you don't yet have the control to save the new data frame. As you say, the top of the dialogue will look something like Prepare > Check Data > Visualise

That has the option to use a whole data frame or to use selected variables. With the selected variables it will look like this:

Then, as in the dialogue above I suggest 3 checkboxes on the left First has label Omit Missing Rows and is unchecked Second is Center Each Variable and is checked Third is Scale Each Variable and is checked. Below that is a control to save the results:

I suggest you start by getting this option working in a new dialogue, and then add the second button later.

rdstern commented 3 years ago

@Vitalis95 I wonder how you are getting on with this new dialogue? It will be followed by further new dialogues for cluster analysis. I notice you are also working with @shadrackkibet on the new Rename dialogue. That's all good and both are important. If you are feeling overloaded, then perhaps also involve one of the others in some of the tasks? I now wonder about calling the menu item simply Scale/Distance and putting it at the bottom of the Data: Reshape menu. Could you also add a help context number, which is 599.

Vitalis95 commented 3 years ago

WhatsApp Image 2021-11-05 at 16 07 04 @rdstern @shadrackkibet , this is the sketch for cluster analysis dialog. I have started with the first button, Partitioning Clustering I was inquiring if a receiver for selecting columns is required because for clustering, it requires a numeric matrix of data. I welcome your suggestions on this new dialog , also, if more parameters are to be implemented. Thanks

rdstern commented 3 years ago

@Vitalis95 I am delighted you are at last getting started on this. I think you can get a bit further though. I have looked quickly again at the book above and see he uses the cluster and then his own factoextra package.
So I looked quickly at the cluster package. I assume you may start by being able to run the pam and agnes functions? Is that what you have in mind? If so, then I see that you can give a set of numerical variables - so one option will be to have a multiple receiver. Or you can give a dissimilarity matrix, which is (I think) what you have produced with your existing dialogue in Prepare?

Then I suggest you try running these in the script window, or RStudio, to start with, - perhaps using the book, or (more easily) some of the examples. Use library (cluster) in the script window to make your life easier if the start there.

That should inform you if you are on the right track and which options from those commands you need to include.

N-thony commented 2 years ago

@Vitalis95 is this still in the Review in Progress column, if not then move to the respective column.

IDEMSInternational / R-Instat

Add cluster analysis to R-Instat #6195