davidcarslaw / openair

Tools for air quality data analysis
https://davidcarslaw.github.io/openair/
GNU General Public License v2.0
305 stars 113 forks source link

Difference between corPlot() and cor() with missing values. #375

Open CamiloMontesM opened 10 months ago

CamiloMontesM commented 10 months ago

Question

Hello, I am comparing 2 correlation methods in R: corPlot and cor (from statspackage). The results of the correlation coefficients between corPlot and cor are identical when in the cor function I use use = "pairwise.complete.obs".

However, when using use = "na.or.complete" in the cor function, the results are completely different in some cases. In the example below, the pair C1 and E4 have no missing values and yet the correlation coefficient using use = "na.or.complete" in cor function, is different from that delivered by corPlot.

Why does corPlot use the "pairwise.complete.obs" method, instead of other methods to take care of missing values?

library(openair)

data <- tibble::tribble(
  ~Inicio.muestreo, ~Parámetro, ~E4, ~E3, ~E2, ~E1, ~E10, ~E9, ~E8, ~E7, ~E6, ~E5, ~O1, ~C1, ~C2, ~C3, ~GNL1, ~GNL2,
      "2023-05-25",  "Benceno",   1.87,    2.9,   4.18,   3.93,    1.82,   2.18,   1.43,   1.34,   1.06,   1.18,   2.79,   3.08,   4.46,   2.25,      1.3,     1.15,
      "2023-06-22",  "Benceno",    0.5,   0.81,   0.63,    0.6,    1.46,   1.26,   0.81,   0.78,   0.49,   0.61,     NA,   2.58,   2.53,   1.21,      0.4,       NA,
      "2023-04-13",  "Benceno",   0.61,   0.71,   1.41,    0.8,    1.14,   0.83,   0.88,   0.86,   0.56,   0.52,   0.68,   3.47,    2.2,   0.94,     0.43,     0.49,
      "2023-06-08",  "Benceno",   1.08,   1.04,   1.09,   1.11,    1.49,   1.42,   1.12,   1.19,   0.85,   0.89,   1.01,   2.86,   3.58,   1.66,     0.82,     0.92,
      "2023-05-10",  "Benceno",   1.91,   3.04,   3.03,   2.76,    1.59,   1.55,   1.43,   1.32,   1.12,   1.21,   2.34,   3.95,   3.38,   1.83,     1.71,     1.62,
      "2023-04-27",  "Benceno",   0.85,   0.99,   1.38,   1.25,    1.65,   1.52,   1.12,   1.07,    0.8,   0.76,   0.89,   4.01,   3.18,   2.01,     0.76,     0.69,
      "2023-07-06",  "Benceno",   1.16,   1.31,   1.51,   1.49,    1.74,   1.62,   1.37,   1.15,   0.95,   1.01,   1.17,   2.72,     NA,     NA,     1.05,     0.99,
      "2023-07-20",  "Benceno",    0.9,   1.01,   0.93,   1.11,    2.07,   1.45,   1.15,   0.97,   0.65,   0.66,   0.86,   2.78,   2.41,   1.33,     0.99,     0.86,
      "2023-08-17",  "Benceno",   0.57,   0.48,   0.62,   0.86,    0.77,   0.79,    0.7,   0.58,   0.47,   0.56,   0.73,   1.48,   1.29,   0.77,     0.56,     0.58,
      "2023-08-31",  "Benceno",    0.6,   0.59,   0.65,   1.06,    1.34,   1.42,   1.05,   0.66,   0.58,    0.6,   0.91,   2.41,   2.67,   1.33,     0.57,     0.74,
      "2023-09-14",  "Benceno",   0.58,   0.61,   0.74,   1.15,    0.87,   0.77,   0.72,   0.64,   0.49,   0.41,   0.81,   1.59,   1.16,   0.73,     0.56,     0.59,
      "2023-09-28",  "Benceno",   0.57,   0.66,   1.12,    0.5,    1.47,   0.87,   0.67,   0.72,   0.61,   0.63,   0.87,   2.16,   1.36,   0.68,     0.57,     0.67,
      "2023-10-12",  "Benceno",   0.67,   0.87,   1.78,   0.96,    1.12,   0.77,   0.82,   0.75,   0.55,   0.71,   0.92,   3.81,   1.23,   0.66,     0.51,     0.54,
      "2023-10-25",  "Benceno",   0.92,   0.47,   0.49,    0.5,    1.12,   1.02,    0.7,   0.54,   0.37,   0.38,   0.57,   3.59,   1.14,   0.59,     0.43,      0.3
  )

corPlot(data[,3:18],
        lower = TRUE,
        method = "spearman",
        dendrogram = TRUE,
        cluster = TRUE)

cor(data[,3:18],
    method = "spearman",
    use = "pairwise.complete.obs") # same to openair

cor_b <- cor(data[,3:18],
    method = "spearman",
    use = "na.or.complete") # different to openair

Thank you very much.

Camilo Montes.

davidcarslaw commented 10 months ago

Hi Camilo

Thanks for this question and the good example. The question is a good one and ultimately depends on the nature of the data being looked at. Originally it was thought that "pairwise.complete.obs" would be a sensible default. However, the function really ought to give the user the option of changing this important parameter. I can see there would be some situations where "pairwise.complete.obs" would not be appropriate and could be potentially misleading.

I will amend the function to explicitly pass on that argument and make it clearer what the default is. This would not take long and will appear on GitHub before a new release to CRAN...

Thanks again for your message.

David

CamiloMontesM commented 10 months ago

Thank you very much for your quick reply, it is very clear to me and I will wait for a new version of the function.

Best regards.