AntoineSoetewey / statsandr

A blog on statistics and R aiming at helping academics and professionals working with data to grasp important concepts in statistics and to apply them in R. See www.statsandr.com
http://statsandr.com/
35 stars 15 forks source link

blog/correlogram-in-r-how-to-highlight-the-most-correlated-variables-in-a-dataset/ #23

Closed utterances-bot closed 3 years ago

utterances-bot commented 3 years ago

Correlogram in R: how to highlight the most correlated variables in a dataset - Stats and R

Make the the most correlated variables stand out via a correlogram. See also how to enhance a correlation plot to show significant correlations among variables

https://statsandr.com/blog/correlogram-in-r-how-to-highlight-the-most-correlated-variables-in-a-dataset/

AntoineSoetewey commented 3 years ago

Comment written by Rick on February 24, 2020 02:47:58:

please look at my generalCorr package on CRAN which computes generalized correlations
it allows for nonlinear relations, generally better fits as u can see
easy to use single command
generalized matrix is nonsymmetric
cause is in the column and effect along the rows
relative size of off-diagonals help identify causal paths

> options(np.messages=FALSE)
> gmcmtx0(mtcars)
mpg cyl disp hp drat wt
mpg 1.0000000 -0.8557900 -0.9508994 -0.9379374 0.68455456 -0.9162983
cyl -0.9433125 1.0000000 0.9759183 0.9583212 -0.75124945 0.8717790
disp -0.8941676 0.9151419 1.0000000 0.9306311 -0.76973716 0.9014653
hp -0.8530474 0.8446589 0.8170031 1.0000000 -0.55427987 0.6931286
drat 0.6878267 -0.7015970 -0.9458881 -0.7434288 1.00000000 -0.7502746
wt -0.9174405 0.7823903 0.9678176 0.9187065 -0.73035365 1.0000000
qsec 0.7512435 -0.5908602 -0.6090980 -0.7539446 0.02921025 -0.1878419
vs 0.6942734 -0.8141714 -0.7820383 -0.8418120 0.51150265 -0.6366761
am 0.6031004 -0.5224062 -0.6715676 -0.5603310 0.98069420 -0.8383474
gear 0.4920451 -0.4957241 -0.6123745 -0.5830729 0.87635616 -0.6124648
carb -0.5572555 0.5681535 0.6718953 0.8553318 -0.84876146 0.5966302
qsec vs am gear carb
mpg 0.7383513 0.6640279 0.5997807 0.6529907 -0.65305643
cyl -0.8496465 -0.8108103 -0.5225462 -0.7274627 0.67555239
disp -0.7613347 -0.7104124 -0.5912007 -0.7666091 0.50701342
hp -0.9267578 -0.7230940 -0.2416089 -0.6629877 0.78661538
drat 0.5493953 0.4402134 0.7127009 0.8316282 -0.02225962
wt -0.7718667 -0.5548968 -0.6924849 -0.6575598 0.59412177
qsec 1.0000000 0.7445309 -0.2284076 -0.6333628 -0.66101228
vs 0.9505094 1.0000000 0.1649016 0.6176113 -0.69219670
am -0.6116860 0.1665178 1.0000000 0.8089357 0.34500941
gear -0.6150966 0.2052627 0.7940532 1.0000000 0.55254965
carb -0.7746992 -0.5695923 0.0169436 0.4324314 1.00000000
>

AntoineSoetewey commented 3 years ago

Comment written by Rick on February 24, 2020 02:47:58:

please look at my generalCorr package on CRAN which computes generalized correlations it allows for nonlinear relations, generally better fits as u can see easy to use single command generalized matrix is nonsymmetric cause is in the column and effect along the rows relative size of off-diagonals help identify causal paths

options(np.messages=FALSE) gmcmtx0(mtcars) mpg cyl disp hp drat wt mpg 1.0000000 -0.8557900 -0.9508994 -0.9379374 0.68455456 -0.9162983 cyl -0.9433125 1.0000000 0.9759183 0.9583212 -0.75124945 0.8717790 disp -0.8941676 0.9151419 1.0000000 0.9306311 -0.76973716 0.9014653 hp -0.8530474 0.8446589 0.8170031 1.0000000 -0.55427987 0.6931286 drat 0.6878267 -0.7015970 -0.9458881 -0.7434288 1.00000000 -0.7502746 wt -0.9174405 0.7823903 0.9678176 0.9187065 -0.73035365 1.0000000 qsec 0.7512435 -0.5908602 -0.6090980 -0.7539446 0.02921025 -0.1878419 vs 0.6942734 -0.8141714 -0.7820383 -0.8418120 0.51150265 -0.6366761 am 0.6031004 -0.5224062 -0.6715676 -0.5603310 0.98069420 -0.8383474 gear 0.4920451 -0.4957241 -0.6123745 -0.5830729 0.87635616 -0.6124648 carb -0.5572555 0.5681535 0.6718953 0.8553318 -0.84876146 0.5966302 qsec vs am gear carb mpg 0.7383513 0.6640279 0.5997807 0.6529907 -0.65305643 cyl -0.8496465 -0.8108103 -0.5225462 -0.7274627 0.67555239 disp -0.7613347 -0.7104124 -0.5912007 -0.7666091 0.50701342 hp -0.9267578 -0.7230940 -0.2416089 -0.6629877 0.78661538 drat 0.5493953 0.4402134 0.7127009 0.8316282 -0.02225962 wt -0.7718667 -0.5548968 -0.6924849 -0.6575598 0.59412177 qsec 1.0000000 0.7445309 -0.2284076 -0.6333628 -0.66101228 vs 0.9505094 1.0000000 0.1649016 0.6176113 -0.69219670 am -0.6116860 0.1665178 1.0000000 0.8089357 0.34500941 gear -0.6150966 0.2052627 0.7940532 1.0000000 0.55254965 carb -0.7746992 -0.5695923 0.0169436 0.4324314 1.00000000

Comment written by Antoine Soetewey on February 25, 2020 15:34:34:

Thanks for pointing this out Rick, I'll definitely have a look at your package !

AntoineSoetewey commented 3 years ago

Comment written by Bernardo Lares on March 04, 2020 19:57:37:

Hi!

With lares::corr_var() and lares::corr_cross() you can actually run correlations for a whole data.frame with numerical, logical, categorical, date variables... Might complement your function. Check it out on Github: laresbernardo/lares

Cheers!

AntoineSoetewey commented 3 years ago

Comment written by Bernardo Lares on March 04, 2020 19:57:37:

Hi!

With lares::corr_var() and lares::corr_cross() you can actually run correlations for a whole data.frame with numerical, logical, categorical, date variables... Might complement your function. Check it out on Github: laresbernardo/lares

Cheers!

Comment written by Antoine Soetewey on March 06, 2020 10:04:46:

Hi Bernardo,

Thanks a lot for your input and for this nice package. I have added a section explaining the corr_cross() and corr_var() functions. Feel free to let me know if there is any inconsistency.

Regards, Antoine

AntoineSoetewey commented 3 years ago

Comment written by Bernardo Lares on March 04, 2020 19:57:37: Hi! With lares::corr_var() and lares::corr_cross() you can actually run correlations for a whole data.frame with numerical, logical, categorical, date variables... Might complement your function. Check it out on Github: laresbernardo/lares Cheers!

Comment written by Antoine Soetewey on March 06, 2020 10:04:46:

Hi Bernardo,

Thanks a lot for your input and for this nice package. I have added a section explaining the corr_cross() and corr_var() functions. Feel free to let me know if there is any inconsistency.

Regards, Antoine

Comment written by Bernardo Lares on March 06, 2020 13:08:35:

Awesome, thanks for sharing. I'd replicate that example but with a dataset that contains categorical values so the value of the function can be fully understood. You can use data(dft) when lares is loaded to get the Titanic dataset.

Here is another post that will complement your information:

Cheers and glad to know lares is already in two of your posts! :)

AntoineSoetewey commented 3 years ago

Comment written by Bernardo Lares on March 04, 2020 19:57:37: Hi! With lares::corr_var() and lares::corr_cross() you can actually run correlations for a whole data.frame with numerical, logical, categorical, date variables... Might complement your function. Check it out on Github: laresbernardo/lares Cheers!

Comment written by Antoine Soetewey on March 06, 2020 10:04:46: Hi Bernardo, Thanks a lot for your input and for this nice package. I have added a section explaining the corr_cross() and corr_var() functions. Feel free to let me know if there is any inconsistency. Regards, Antoine

Comment written by Bernardo Lares on March 06, 2020 13:08:35:

Awesome, thanks for sharing. I'd replicate that example but with a dataset that contains categorical values so the value of the function can be fully understood. You can use data(dft) when lares is loaded to get the Titanic dataset.

Here is another post that will complement your information:

Cheers and glad to know lares is already in two of your posts! :)

Comment written by Antoine Soetewey on April 12, 2020 07:25:39:

Sorry for my late reply. Like your previous comment, it was categorised as spam due to the links, I'll make sure to check my spam more often.

I tend to use the same dataset for the entire article to make it simple, and as the first part is only possible with quantitative variables, I prefer to use only quantitative variables in the illustration of your functions.

I'll make sure to include categorical variables if I use this code again in a future article. I have included a link to your article if readers want more information about your package.

Regards, Antoine

AntoineSoetewey commented 3 years ago

Comment written by Robert C Cline, Sr. on August 16, 2020 16:02:08:

Your function corrplot2 runs for me in RMarkdown.  I tried to save the corrplot2 function in a snippet.  There are no variables to add in the snippet, so it is your function verbatim.  When I call the snippet to R script, it first requires the psych package.  After loading the psych package, when I run the function, it loads with the following comment:

" bytecode: 0x000002974bb3dc48"

I then run the function after assigning dat <- mtcars corrplot2(   data = dat,   method = "pearson",   sig.level = 0.05,   order = "original",   diag = FALSE,   type = "upper",   tl.srt = 75 )
The function fails with the following message:   Error in cor.mtest(data, method = method) : object 'tmp.value' not found

Do you know how to resolve this? 

Thank you.
Robert

AntoineSoetewey commented 3 years ago

Comment written by Robert C Cline, Sr. on August 16, 2020 16:02:08:

Your function corrplot2 runs for me in RMarkdown.  I tried to save the corrplot2 function in a snippet.  There are no variables to add in the snippet, so it is your function verbatim.  When I call the snippet to R script, it first requires the psych package.  After loading the psych package, when I run the function, it loads with the following comment:

" bytecode: 0x000002974bb3dc48"

I then run the function after assigning dat <- mtcars corrplot2(   data = dat,   method = "pearson",   sig.level = 0.05,   order = "original",   diag = FALSE,   type = "upper",   tl.srt = 75 ) The function fails with the following message:   Error in cor.mtest(data, method = method) : object 'tmp.value' not found

Do you know how to resolve this? 

Thank you. Robert

Comment written by Antoine Soetewey on August 17, 2020 06:44:23:

Dear Robert,

I tried your code on my machine, and it works perfectly. See the image below for the code and the result:

comment Correlogram in R

Hope this helps.

Regards,
Antoine

Piresmoo commented 3 years ago

Dear Antoine Thank you for your page on Correlogram in R. I intend to do a plot using method = "spearman", use = "pairwise.complete.obs". I the scrip seems to do use = "complete.obs", and I don’t find a way to change it. I have empty cells here and there, and I am looking for a good graphical representation that produces the same corr coefficient as “proc corr spearman” of SAS (which is the case with method = "spearman", use = "pairwise.complete.obs").

I also have been looking for ways to produce a plot with an output similar to that obtained with VAR and WITH commands in SAS PROC CORR . I found this website below but I can’t produce a matrix with P-values that match the correlation values. This would allow to omit the nonsignificant correlations (https://www.statmethods.net/stats/correlations.html)

Correlation matrix from mtcars

with mpg, cyl, and disp as rows

and hp, drat, and wt as columns

x <- mtcars[1:3] y <- mtcars[4:6] cor(x, y) ) Thanks for your help. Sincerely Piresmoo

AntoineSoetewey commented 3 years ago

Dear Antoine Thank you for your page on Correlogram in R. I intend to do a plot using method = "spearman", use = "pairwise.complete.obs". I the scrip seems to do use = "complete.obs", and I don’t find a way to change it. I have empty cells here and there, and I am looking for a good graphical representation that produces the same corr coefficient as “proc corr spearman” of SAS (which is the case with method = "spearman", use = "pairwise.complete.obs").

I also have been looking for ways to produce a plot with an output similar to that obtained with VAR and WITH commands in SAS PROC CORR . I found this website below but I can’t produce a matrix with P-values that match the correlation values. This would allow to omit the nonsignificant correlations (https://www.statmethods.net/stats/correlations.html)

Correlation matrix from mtcars

with mpg, cyl, and disp as rows

and hp, drat, and wt as columns

x <- mtcars[1:3] y <- mtcars[4:6] cor(x, y) ) Thanks for your help. Sincerely Piresmoo

Thanks for your question.

I don't use SAS so I cannot help you on this matter, but in lines 15-16 of the code, you should be able to change the use from "everything" (which is the default) to "pairwise.complete.obs".

Let me know if this fixes your issue.

Regards, Antoine