STAT545-UBC / Discussion

Public discussion
38 stars 20 forks source link

Quickly correlating multiple variables #502

Open aramcb opened 6 years ago

aramcb commented 6 years ago

I have a dataframe (df) that looks like below (dput at bottom of post). There are 3 strains of animals (N2, YT17, KP4) with 6 response variables (e.g., probability, duration, speed, etc) and a corresponding score (percent_diff) for each of the variables. I would like to see if across animal strains are the scores (percent_diff) is correlated with each other? So for example, does a high probability score (percent_diff) correlated with a high duration score (percent_diff)?

Is there a quick way to draw a scatterplot correlating each variable's score with every other variable? So for instance, a scatterplot where the x-value is (percent_diff) for duration and the y-value is percent_diff for speed?

I am aware I can spread the variable scores but that does not quickly solve the correlation issue.

Let me know if you have any tips for this! Thank you,

image

df <- structure(list(strain = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("N2", "3-days-old", "VG88_1s", "YT17", "KP4", "VC1052", "DA1371", "CB120", "EK228", "FX05775", "KG518", "KG744", "KP1182", "lid_off", "MH24301", "PY1589", "RB1256", "RB824", "RM2710", "TM3577", "VC1052_cntm", "VC117", "VC20144", "VC228", "VM487"), class = "factor"), variable = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("prob", "dura", "spd", "mag", "dist", "alrev" ), class = "factor"), percent_diff = c(86.9321198291465, 76.4846917917094, 88.7306176350036, 57.5624885696056, 67.5176298547265, 85.2178945914899, 92.4254628567501, 76.2359628573487, 96.6405841321961, 70.2954995722212, 74.331748324351, 80.7151938970121, 63.5297840817911, 64.7310412896858, 90.1554309717398, 41.383659140458, 59.6974911225825, 91.6167664670659 )), .Names = c("strain", "variable", "percent_diff"), row.names = c(NA, -18L), class = "data.frame")

dtavern commented 6 years ago

You can easily extract correlation coefficients and place them in a n x n matrix (where n is the number of variables of interest) using package corrplot

In corrplot there is a function corrplot that you can plot a cor() output. You need to have your data in wide format with each column corresponding to a variable of interest.

example: corrplot( cor(df[1:3,]), method = "number")

However, these are Pearson's correlations and require linear relationships.

(Check out the vignette here)

If you want to look into plotting each variable against every other one, there are ways to do this in base plot, lattice and ggplot

Theres a good chapter on a data exploration in Analyzing Ecological Data by Zuur et al., 2007 (available as an eBook via UBC library) that outlines various visualization methods for digging into your data.

aramcb commented 6 years ago

@dtavern ah what a great package! thx!