corybrunson / ordr

manage ordinations and render biplots in a tidyverse workflow
https://corybrunson.github.io/ordr/
GNU General Public License v3.0
20 stars 5 forks source link

projection (prediction) and vector sum (interpolation) geoms #4

Open corybrunson opened 5 years ago

corybrunson commented 5 years ago

A geom layer geom_u_projection(from = i, to = j) should render a (by default dashed) line from the ordinates of the ith row of $U$ to the linear subspace containing those of the jth row of $V$, maybe with an optional cute right angle symbol. Both from and to could contain multiple indices. The projection should adopt the axes (primary or secondary) used by $U$.

jtr13 commented 3 years ago

First of all, thank you for this fantastic package (as well as for ggalluvial!). A dropped perpendicular feature would be a great addition: combined with the calibrated axes the perpendiculars really help explain the meaning of a biplot. I've been creating them with calibrate package in base R and have taken a stab at using calibrate calculations in ggplot2 but of course it would be nice to have a smoother interface.

library(calibrate)
#> Loading required package: MASS
    df <- data.frame(x = c(5, 8, 9), y = c(4, 13, 6))
    pca <- prcomp(df)
    scores <- pca$x
    loadings <- pca$rotation
    axisrange <- max(df[,"x"]) - min(df[,"x"])
    ticklab <- round(seq(min(df[,"x"])-.5*axisrange,
                         max(df[,"x"])+.5*axisrange,
                         by = axisrange/2))
    plot(scores, pch = 16, asp = 1, ylim = c(-5, 5))
    text(x = scores[,1]-.5, y = scores[,2], labels = 1:nrow(scores))
    c <- calibrate(g = loadings[,"PC1"],
                   y = df[,"x"] - mean(df[,"x"]),
                   tm = ticklab - mean(df[,"x"]),
                   Fr = scores,
                   tmlab = ticklab,
                   tl = .2,
                   lm = TRUE,
                   axislab="x",
                   where = 1,
                   labpos = 4,
                   dp = TRUE)
#> ---------- Calibration Results for  x  ---------------------
#> Length of 1 unit of the original variable =  1  
#> Angle                                     =  76.3 degrees
#> Optimal calibration factor                =  1  
#> Used calibration factor                   =  1  
#> Goodness-of-fit                           =  1  
#> Goodness-of-scale                         =  1  
#> ------------------------------------------------------------
    arrows(0, 0, loadings[,1], loadings[,2], length = .1, col = "red",
           lwd = 1.5)

Created on 2021-08-25 by the reprex package (v2.0.1)

library(calibrate)
#> Loading required package: MASS
    library(ggplot2)
    df <- data.frame(x = c(5, 8, 9), y = c(4, 13, 6))
    pca <- prcomp(df)
    scores <- pca$x
    loadings <- pca$rotation
    axisrange <- max(df[,"x"]) - min(df[,"x"])
    ticklab <- round(seq(min(df[,"x"])-.5*axisrange,
                         max(df[,"x"])+.5*axisrange,
                         by = axisrange/2))
    c <- calibrate(g = loadings[,"PC1"],
                   y = df[,"x"] - mean(df[,"x"]),
                   tm = ticklab - mean(df[,"x"]),
                   Fr = scores,
                   tmlab = ticklab,
                   tl = .3,
                   graphics = FALSE)
#> ---------- Calibration Results for    ----------------------
#> Length of 1 unit of the original variable =  1  
#> Angle                                     =  76.3 degrees
#> Optimal calibration factor                =  1  
#> Used calibration factor                   =  1  
#> Goodness-of-fit                           =  1  
#> Goodness-of-scale                         =  1  
#> ------------------------------------------------------------

    dfpoints <- data.frame(scores)
    dfpoints$xsdrop <- c$Fpr[,1]
    dfpoints$ysdrop <- c$Fpr[,2]
    dfaxis <- data.frame(x = c$M[1, 1], y = c$M[1, 2],
                         xend = c$M[nrow(c$M),1], yend = c$M[nrow(c$M), 2])
    dfticks <- data.frame(c$M, c$Mn, ticklab)
    colnames(dfticks) <- c("x", "y", "xend", "yend", "label")
    dfarrows <- data.frame(xend = loadings[,1], yend = loadings[,2])

    ggplot(dfpoints, aes(x = PC1, y = PC2)) +
        geom_point() +
        geom_segment(data = dfticks, aes(x = x, y = y, xend = xend, yend = yend), col = "blue") +
        geom_text(data = dfticks, aes(x = xend, y = yend, label = label), nudge_x = .3) +
        geom_segment(data = dfaxis, aes(x = x, y = y, xend = xend, yend = yend), col = "blue") +
        geom_segment(data = dfpoints, aes(x = PC1, y = PC2, xend = xsdrop, yend = ysdrop),
                     lty = "dashed", col = "cornflowerblue") +
        geom_segment(data = dfarrows, aes(x = 0, y = 0, xend = xend, yend = yend),
                     arrow = arrow(length = unit(.03, "npc")), color = "red", lwd = 1.5) + coord_equal()

Created on 2021-08-25 by the reprex package (v2.0.1)

The closest I can get with ordr:

library(ggplot2)
    library(ggbiplot)
#> Loading required package: plyr
#> Loading required package: scales
#> Loading required package: grid
    library(ordr)
#> 
#> Attaching package: 'ordr'
#> The following object is masked from 'package:ggbiplot':
#> 
#>     ggbiplot
    df <- data.frame(x = c(5, 8, 9), y = c(4, 13, 6))
    pca_ordr <- ordinate(df, cols = 1:2, model = ~ prcomp(., scale. = TRUE))
    ggbiplot(pca_ordr) +
        xlim(c(-2, 2)) +
        ylim(c(-2, 2)) +
        geom_rows_point() +
        geom_rows_text(aes(label = 1:5), nudge_x = .2) +
        geom_cols_vector(color = "red", lwd = 1.5) + 
        geom_cols_axis() +
        geom_cols_axis_ticks(aes(center = .center, scale = .scale)) +
        geom_cols_axis_text(aes(center = .center, scale = .scale)) +
        geom_cols_axis_label(aes(label = .name))
#> Warning: Ignoring unknown aesthetics: center, scale

#> Warning: Ignoring unknown aesthetics: center, scale
#> Warning: Ignoring unknown aesthetics: label

Created on 2021-08-25 by the reprex package (v2.0.1)

I should also note that it would be helpful to be able to plot one axis at a time since the graph gets messy with multiple axes, though that might be challenge in the ggplot2 framework.

corybrunson commented 3 years ago

@jtr13 thank you for these details!

I'm going to reassess what features should go into a first release, and you've got me thinking that the projection geom is one that should be put back on that list. (Meanwhile, the predict = TRUE option for ggbiplot() is perhaps too ambitious, and i'm planning to remove it for the first release.) I'll come back to work on it ASAP.

Ditto the ability to specify what row or column elements should be plotted by each layer, as you mention at the end. An experimental solution is the subset parameter, which is implemented in GeomIsolines$setup_data and recycled by a few other layers, not including GeomAxis. I'll add an issue to expand this to all plot layers.

corybrunson commented 3 years ago

Correction and clarification: Thanks to Gower &al's books, i should have remembered that projection onto an axis is a prediction step that calls for a prediction biplot, whereas vector addition along the axes is an interpolation step that makes use of the (currently implemented) interpolation biplot. (Interpolation axis unit vectors are located analogously to case markers, whereas prediction axes must be rescaled in the linear setting and completely redrawn in the non-linear setting.) This issue should really be about rendering both sets of steps, each being appropriate for one type of biplot. Vector addition should come first, for the first CRAN release.

corybrunson commented 3 years ago

In the conventional use of PCA with rows of cases and columns of variables, the vector sum geom could look like geom_cols_addition(rows = new_data). The layer would inherit the column (standard) coordinates from ggbiplot(), and the rows (or perhaps new_data) parameter would accept rows of data with the same variables as the original table fed to, e.g., princomp(), or a list of variable values, e.g. list(mpg = 28, cyl = 4L, ...). If some values are missing or NA but the original PCA is centered, then the interpolated points would correspond to mean imputations.

It shouldn't be much extra work to include an option type = c("centroid", "sequence") to control whether the vectors along the axes are added in sequence or their centroid is scaled by their number.

corybrunson commented 3 years ago

Experimental work is ongoing in the vecsum branch.

corybrunson commented 2 years ago

Update: Based on further reading, and on the counterintuitiveness and complexity of the experimental implementation, i think both of these annotations should work as follows:

  1. Make generics and S3 class methods to calculate layer data (like the output of Stat*$compute_*()) for predictive layers, interpolation vector sums, prediction projections, and other geometric elements derived from matrix algebra.
  2. Make a pronoun (.O()?) for the underlying ordination model, analogous to .G() in tidygraph, by which to access these methods from within a ggproto build.
  3. Make new statistical transformations StatRows* and StatCols* that use this pronoun to calculate layer data.

The reason for a change in approach is that the 'tbl_ord' class is designed for display (print, summary, annotation, plot), and ggbiplot() itself can only see the fortified tibble, whereas the underlying ordination classes, which are designed for modeling (interpolate, predict), are the proper setting for these tasks. For example, predictive MDS biplots use nonlinear axes while predictive LDA biplots use Voronoi cells, and the S3 methods can be tailored to these needs.

I think this deserves to wait until feedback is collected from a first CRAN release.

A compensatory benefit will be the new generics and methods themselves, which might be used in base R biplots or for other purposes. To my knowledge, there are no standalone implementations of them; they are only built inaccessibly into visualization tools like those of Gower &al.