IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 102 forks source link

New corrr package in R for correlations #6121

Open lilyclements opened 3 years ago

lilyclements commented 3 years ago

I came across the new corrr package which deals with handling correlations in R. I thought I should summarise main aspects of the package in case any of the features could fit into R-Instat.

Main Correlation Function The function to perform correlations is correlate. This function seemingly runs the same as the cor function we currently use however there are a few minor differences:

Plotting Functions There are two new functions with respect to plotting.

rplot plots a correlation data frame using ggplot2 There are options to amend the plot (order variables alphabetically, add the correlation values, etc), however, the standard plot gives rplot(correlate(mtcars))

image

The other plot is network_plot. This is not plotting correctly on this laptop, however, I will try on another laptop. According to here it should look like this (they have plotted for only five variables: mpg-drat).

mtcars %>% 
  correlate() %>% 
  focus(mpg:drat, mirror = TRUE) %>% 
  network_plot()

image

Other Functions There are a few functions to help "clear up" the correlation matrix which probably are not so relevant here. But I'll summarise a few of them:

 A tibble: 121 x 3
  x      y       r
  mpg   mpg    NA    
  mpg   cyl    -0.852
  mpg   disp   -0.848
  mpg   hp     -0.776
  mpg   drat    0.681
  mpg   wt     -0.868
  mpg   qsec    0.419
 ...

As a final sidenote, I noticed on dlgCorrelations that the "Options" button is at the end of the ucrSave. Since there is now the new "Position" button on a ucrSave, this "Options" could be confused with options for the saving, rather than the dialog. I suggest it is made a bit smaller, and shifted left so that it aligns with the end of the comment box.

lilyclements commented 3 years ago

Another feature of the correlate function in this package that I have come across is how it deals with missing values.

The usual cor function puts an NA when considering correlations if there is a missing value in the variable. However, the correlate function calculates the value in the "complete case" scenario:

library(tidyverse)
library(corrr)

data(mtcars)
mtcars[5,5] <- NA    # set a value as missing in the "drat" variable

cor(mtcars$drat, mtcars$wt)   # correlation is NA
cor(mtcars)                              # all correlations for the "drat" variable is NA

correlate(mtcars$drat, mtcars$wt)    # correlation is -0.715
correlate(mtcars)                               # correlation is given for all variables despite NAs

mtcars.complete <- mtcars %>% filter(complete.cases(mtcars))       # find the complete case data
cor(mtcars.complete$drat, mtcars.complete$wt)                               # gives the same values as the correlate function