bnaras / PMA

4 stars 3 forks source link

add names to matrix and vector outputs #5

Open corybrunson opened 2 years ago

corybrunson commented 2 years ago

The matrices u and v and the vectors d and cors in the output of PMA::CCA(), for example, are unnamed. Maybe this is intentional for compatibility with certain routines. But it would be helpful for other purposes to have row and column names from the input data matrices x and z incorporated into the output, and for the dimension to be canonically named. Preserving names from the input data in the output would, for example, make it easier to read the output and to create other named objects from it, as below.

I would suggest the following for the CCA() output, for example:

If this is of interest, then i would be glad to submit a PR with suggested assignments for the outputs of SPC(), CCA(), and MultiCCA(). Thank you!

library(PMA)

# CCA of life cycle savings data
savings_cca <- CCA(
  LifeCycleSavings[, c(2L, 3L)],
  LifeCycleSavings[, c(1L, 4L, 5L)],
  K = 2L, penaltyx = .7, penaltyz = .7
)
#> 12
#> 12

# without names
print(savings_cca$u)
#>      [,1] [,2]
#> [1,]   -1    1
#> [2,]    0    0
print(savings_cca$v)
#>           [,1]        [,2]
#> [1,] 0.2422123 -0.98598522
#> [2,] 0.9702233  0.14634075
#> [3,] 0.0000000 -0.08010956
# with names (suggested)
rownames(savings_cca$u) <- names(LifeCycleSavings)[c(2L, 3L)]
rownames(savings_cca$v) <- names(LifeCycleSavings)[c(1L, 4L, 5L)]
colnames(savings_cca$u) <- colnames(savings_cca$v) <-
  paste0("sCD", seq(savings_cca$K))
print(savings_cca$u)
#>       sCD1 sCD2
#> pop15   -1    1
#> pop75    0    0
print(savings_cca$v)
#>           sCD1        sCD2
#> sr   0.2422123 -0.98598522
#> dpi  0.9702233  0.14634075
#> ddpi 0.0000000 -0.08010956

# one benefit: data frame names
print(as.data.frame(savings_cca$u))
#>       sCD1 sCD2
#> pop15   -1    1
#> pop75    0    0
tibble::rownames_to_column(as.data.frame(savings_cca$v), var = "response")
#>   response      sCD1        sCD2
#> 1       sr 0.2422123 -0.98598522
#> 2      dpi 0.9702233  0.14634075
#> 3     ddpi 0.0000000 -0.08010956

# another benefit: reveal matrix multiplication error
t(savings_cca$u) %*% diag(savings_cca$d) %*% t(savings_cca$v) # wrong row names
#>             sr       dpi ddpi
#> sCD1 -10.01703 -40.12494    0
#> sCD2  10.01703  40.12494    0
savings_cca$u %*% diag(savings_cca$d) %*% t(savings_cca$v) # right row names
#>              sr      dpi      ddpi
#> pop15 -22.60722 -38.2563 -1.022931
#> pop75   0.00000   0.0000  0.000000

Created on 2022-02-04 by the reprex package (v2.0.1)

corybrunson commented 2 years ago

I realize now that the arguments xnames and znames partially resolve this issue. I think it would be appropriate for them to default to colnames(x) and colnames(z), respectively, and this would be part of the proposed PR. I apologize for overlooking that!

bnaras commented 2 years ago

I pushed a commit with the defaults for xnames and znames.

corybrunson commented 2 years ago

@bnaras very cool, thank you!

corybrunson commented 2 years ago

It looks like the names may not be preserved through the process. If x and z are matrices, then names() doesn't get their column names; and, when they are data frames, the scale() calls (inside CCA()) convert them to matrices before names() are obtained. These problems should be solved by replacing names() with colnames(), which works both on data frames and on matrices.

library(PMA)
sessioninfo::session_info(pkgs = "PMA")
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 3.6.0 (2019-04-26)
#>  os       macOS  10.15.7
#>  system   x86_64, darwin15.6.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2022-02-06
#>  pandoc   2.16.2 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package * version date (UTC) lib source
#>  PMA     * 1.2-2   2022-02-06 [1] Github (bnaras/PMA@8e3fd29)
#> 
#>  [1] /Users/jason.brunson/Library/R/3.6/library
#>  [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

# names of data frame inputs
names(LifeCycleSavings)
#> [1] "sr"    "pop15" "pop75" "dpi"   "ddpi"

# CCA of life cycle savings data
savings_cca <- CCA(
  as.matrix(LifeCycleSavings[, c(2L, 3L)]),
  as.matrix(LifeCycleSavings[, c(1L, 4L, 5L)]),
  K = 2L, penaltyx = .7, penaltyz = .7
)
#> 12
#> 12

# missing names
savings_cca$u
#>      [,1] [,2]
#> [1,]   -1    1
#> [2,]    0    0
savings_cca$xnames
#> NULL

Created on 2022-02-06 by the reprex package (v2.0.1)

bnaras commented 2 years ago

Gah, that was my bad. Pushed a commit.