gavinsimpson / gratia

ggplot-based graphics and useful functions for GAMs fitted using the mgcv package
https://gavinsimpson.github.io/gratia/
Other
206 stars 28 forks source link

draw() error when trying to plot 2d smooths with bs='sz' or 'fs' #249

Closed chrishaak closed 9 months ago

chrishaak commented 9 months ago

I'm fitting an HGAM with a global 2d smooth and a factor smooth interaction of the 2d smooth (via bs='sz' or 'fs'). For example:

library("gratia")
library("mgcv")
data("iris")

mod2 <- gam(Petal.Width ~ 
              s(Sepal.Length, Sepal.Width, bs='ds', m=c(1,0.5)) 
            + s(Species, Sepal.Length, Sepal.Width, bs='sz', xt=list(bs='ds', m=c(1,0.5))),
            method = "REML",
            data = iris)

mod3 <- gam(Petal.Width ~ 
              s(Sepal.Length, Sepal.Width, bs='ds', m=c(1,0.5)) 
            + s(Species, Sepal.Length, Sepal.Width, bs='fs', xt=list(bs='ds', m=c(1,0.5))),
            method = "REML",
            data = iris)

When attempting to plot these with draw(), I receive (altogether different) errors for both. For the 'sz' smooth, I am getting:

Error in map(): ℹ In index: 1. Caused by error in mutate(): ℹ In argument: .sz_var = interaction(object[variables[fs]], sep = ":", lex.order = TRUE). Caused by error: ! .sz_var must be size 150 or 1, not 30000.

while for the 'fs' smooth I am getting:

Error in seq.default(from = min(x, na.rm = TRUE), to = max(x, na.rm = TRUE), : 'length.out' must be a non-negative number

If I omit the second (factor-smooth interaction) term via select=c(-2) in the call to draw(), the first term plots fine for both of the fits.

Finally, using the 'by' argument (instead of 'fs' or 'sz'), both terms plot fine. for example:

mod4 <- gam(Petal.Width ~ 
              s(Sepal.Length, Sepal.Width, bs='ds', m=c(1,0.5)) 
            + s(Sepal.Length, Sepal.Width, bs='ds', m=c(1,0.5), by=Species),
            method = "REML",
            data = iris)

...draws both terms as expected. So it seems to be an issue with the 'fs' and 'sz' in particular...

Apologies if I am missing something here?

Thanks! Chris

gavinsimpson commented 9 months ago

Thanks for the report and easy to implement examples to reproduce the issue. I think the fundamental issue is that I didn't consider 2d TPR or Duchon splines when I wrote the interval plotters for the sz and fs basis.

Would you mind adding the output from sessionInfo() (or session info::session_info() if you have it) to your report.

I'll confirm what's going on later this week – both models should be expected to work with gratia – and see what needs to be changed to make them work.

chrishaak commented 9 months ago

Thanks for the lightning-fast response Gavin. Sorry, I actually had the sessioninfo in my code but forgot to paste it!

R version 4.2.3 (2023-03-15) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Ubuntu 20.04.6 LTS

Matrix products: default BLAS: /home/christopher/miniconda3/envs/r423_blis/lib/libblis.so.4.0.0 LAPACK: /home/christopher/miniconda3/envs/r423_blis/lib/liblapack.so.3.9.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] mgcv_1.9-0 nlme_3.1-163 gratia_0.8.2

loaded via a namespace (and not attached): [1] Rcpp_1.0.12 pillar_1.9.0 compiler_4.2.3 RColorBrewer_1.1-3 tools_4.2.3 mvnfast_0.2.8 lifecycle_1.0.4
[8] tibble_3.2.1 gtable_0.3.4 lattice_0.22-5 pkgconfig_2.0.3 rlang_1.1.2 Matrix_1.6-2 cli_3.6.1
[15] rstudioapi_0.15.0 patchwork_1.1.3 withr_2.5.2 dplyr_1.1.3 stringr_1.5.0 generics_0.1.3 vctrs_0.6.4
[22] isoband_0.2.7 grid_4.2.3 tidyselect_1.2.0 glue_1.6.2 R6_2.5.1 fansi_1.0.5 ggplot2_3.4.4
[29] purrr_1.0.2 tidyr_1.3.0 farver_2.1.1 magrittr_2.0.3 scales_1.2.1 splines_4.2.3 colorspace_2.1-0
[36] labeling_0.4.3 utf8_1.2.4 stringi_1.8.1 munsell_0.5.0

gavinsimpson commented 9 months ago

Thanks for the report. As I suspected, the different errors were ultimately due the underlying code not knowing how to handle multivariate base smoothers.

"fs" smooths

The "fs" smooth issue raised a different error because I was not careful enough to handle users specifying the factor anywhere in the smooth definition. I was assuming users would use s(x1, x2, f) so factor last. I have fixed this particular issue such that if you use s(f, x1, x2) or s(x1, f, x2) smooth_estimates() will now work correctly.

"sz" smooths

The error with that model was purely due to the code not anticipating a multivariate base smoother.

I now catch this and do something rather than let the obscure error pass through.

… however

None of this actually helps you as in both cases {gratia} can't currently do anything useful from the plot side of things.

Right now I am just emitting messages for both your model examples:

r$> draw(i_fs)                                                              
Can't currently plot multivariate 'fs' smooths.
Skipping: s(Sepal.Length,Sepal.Width,Species)

r$> draw(i_sz)                                                              
Can't currently plot multivariate 'sz' smooths.
Skipping: s(Species,Sepal.Length,Sepal.Width)

but it will plot any other smooths it can plot.

It's not clear to me what to plot in the fs case. I could plot a few of the smooth surfaces, randomly selecting which to plot? For the sz case, I could plot the smooth "difference" surfaces, but how to actually do this in the draw() output? Should I treat it like a by smooth and draw a separate plot for each surface, or should I group them so that I only generate a single plot but use facetting to show the difference-from-reference surfaces for each level of the factor(s)?

A similar issue crops up with HGAMs with random tensor product smooths: t2(x1, x2, f, bs = c("cr", "cr", "re"), full = TRUE). Which of these should I plot and how?

So, back to you @chrishaak. I assume you weren't modelling the iris data in reality, so what would you like to see plotted in both your model specifications?

I'll push fixed code shortly; you'll need to install from github or r-universe (if it builds; there's an issues with the dependency on the Matrix pkg which I need to figure out there) when I have pushed. Check that you get version >= 0.8.2.58. At least then you'll be able to use smooth_estimates() to evaluate the smooths and then you can choose to plot them however you wish using {ggplot2}.

chrishaak commented 9 months ago

That's funny- I normally put the factor last in the 'fs' smooths but I vaguely remember there was a requirement somewhere along the way that I needed to put it first to plot 'sz' smooths? I guess that backfired in the 'fs' case...

For the 'sz' case, I was thinking along the lines that you suggest above, plotting the difference surface for each level of species. While I conceptually like the idea of being able to see the differences overlayed/plotted together (as it works so nicely in the 1-D 'sz' case), I'm afraid that this could get messy in the 2D case (particularly when there are many levels). So, from the perspective of legibility/interpretability, it might make sense to keep these plots independent (per-level), basically mimicking what you do in the 'by' smooth?. In my particular situation, I don't have a ton of levels, so it's not not an issue, but understanding that your code needs to be generalizable, maybe adopt an approach like "if nlevels > X, then draw a random sample of X levels", where X defaults to some sane value (8?) And if people want to plot them all or look at specific levels, that can be user-selectable, either by setting X higher, or requesting specific levels via a vector?

I'd probably adopt a similar approach for 'fs', just mimicking the 'by' smooth.

That's my 2 cents anyway - hope it's useful.

Thanks so much once again for all your help!

On Fri, Feb 2, 2024 at 6:37 AM Gavin Simpson @.***> wrote:

Thanks for the report. As I suspected, the different errors were ultimately due the underlying code not knowing how to handle multivariate base smoothers. "fs" smooths

The "fs" smooth issue raised a different error because I was not careful enough to handle users specifying the factor anywhere in the smooth definition. I was assuming users would use s(x1, x2, f) so factor last. I have fixed this particular issue such that if you use s(f, x1, x2) or s(x1, f, x2) smooth_estimates() will now work correctly. '"sz"` smooths

The error with that model was purely due to the code not anticipating a multivariate base smoother.

I now catch this and do something rather than let the obscure error pass through. … however

None of this actually helps you as in both cases {gratia} can't currently do anything useful from the plot side of things.

Right now I am just emitting messages for both your model examples:

r$> draw(i_fs) Can't currently plot multivariate 'fs' smooths. Skipping: s(Sepal.Length,Sepal.Width,Species)

r$> draw(i_sz) Can't currently plot multivariate 'sz' smooths. Skipping: s(Species,Sepal.Length,Sepal.Width)

but it will plot any other smooths it can plot.

It's not clear to me what to plot in the fs case. I could plot a few of the smooth surfaces, randomly selecting which to plot? For the sz case, I could plot the smooth "difference" surfaces, but how to actually do this in the draw() output? Should I treat it like a by smooth and draw a separate plot for each surface, or should I group them so that I only generate a single plot but use facetting to show the difference-from-reference surfaces for each level of the factor(s)?

A similar issue crops up with HGAMs with random tensor product smooths: t2(x1, x2, f, bs = c("cr", "cr", "re"), full = TRUE). Which of these should I plot and how?

So, back to you @chrishaak https://github.com/chrishaak. I assume you weren't modelling the iris data in reality, so what would you like to see plotted in both your model specifications?

I'll push fixed code shortly; you'll need to install from github or r-universe (if it builds; there's an issues with the dependency on the Matrix pkg which I need to figure out there) when I have pushed. Check that you get version >= 0.8.2.58.

— Reply to this email directly, view it on GitHub https://github.com/gavinsimpson/gratia/issues/249#issuecomment-1923632856, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORLQXDKJ3MPUJDDSQPVIL3YRTFZHAVCNFSM6AAAAABCTL6RJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRTGYZTEOBVGY . You are receiving this because you were mentioned.Message ID: @.***>

gavinsimpson commented 9 months ago

...I vaguely remember there was a requirement somewhere along the way that I needed to put it first to plot 'sz' smooths?

I also thought this was the case, but it turns out you can put it anywhere.

I'll see what I can do. Right now, by smooths work easily because from mgcv's point of view these are really just entirely separate smooths - they are a separate entry in the $smooth list that is return by gam(), bam() etc. "fs" and "sz" smooths are different; there is just one element in $smooth for the entire "fs" or "sz" smooth. This is fine with univariate base smooths as I can produce one plot showing all levels (even if it gets messy, with the sz basis in particular), but that won't work for surfaces as I can't plot them on top of one another.

The solution I have used with trivariate and quadvariate smooths is to use facetting; the one "plot" would then have multiple facets, one per level of the factor.

Otherwise I'll need to look at how to return multiple ggplot objects for a single smooth if I am to follow the by smooth convention...

chrishaak commented 9 months ago

Gotcha... seems like facetting is the way to go then, if it is relatively straightforward to implement.

While we're on the topic (feel free to let me know if you'd rather I start a new issue here)...

I'm fitting to scaled covariate data (using 'scale'), but I'd like to see the "real" (back-transformed) predictor values in my visualizations. Short of going into each of the smooth estimates and manually replacing the x values, do you see any intelligent way to do this "across the board" using draw(), given the list of scaling info output by 'scale'? (I dug around a little and didn't see anything, but may have overlooked something)?

On Fri, Feb 2, 2024 at 10:50 AM Gavin Simpson @.***> wrote:

...I vaguely remember there was a requirement somewhere along the way that I needed to put it first to plot 'sz' smooths?

I also thought this was the case, but it turns out you can put it anywhere.

I'll see what I can do. Right now, by smooths work easily because from mgcv's point of view these are really just entirely separate smooths - they are a separate entry in the $smooth list that is return by gam(), bam() etc. "fs" and "sz" smooths are different; there is just one element in $smooth for the entire "fs" or "sz" smooth. This is fine with univariate base smooths as I can produce one plot showing all levels (even if it gets messy, with the sz basis in particular), but that won't work for surfaces as I can't plot them on top of one another.

The solution I have used with trivariate and quadvariate smooths is to use facetting; the one "plot" would then have multiple facets, one per level of the factor.

Otherwise I'll need to look at how to return multiple ggplot objects for a single smooth if I am to follow the by smooth convention...

— Reply to this email directly, view it on GitHub https://github.com/gavinsimpson/gratia/issues/249#issuecomment-1924147693, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORLQXGY62FC6JD3HOZSHULYRUDN3AVCNFSM6AAAAABCTL6RJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRUGE2DONRZGM . You are receiving this because you were mentioned.Message ID: @.***>

gavinsimpson commented 9 months ago

I'm fitting to scaled covariate data (using 'scale'), but I'd like to see the "real" (back-transformed) predictor values in my visualizations

I assume you mean that you have y ~ s(x) where x is the result of x <- scale(x_orig)[,1]? There isn't a good solution to that currently. Right now I would suggest that you just evaluate the smooths with smooth_estimates(), and then build your own plots. I'm looking at ways to make building your own plots that look like draw()s plots even easier.

Until then, you could modify the output from smooth_estimates() and then use its draw() method:

library("dplyr")
library("mgcv")
library("gratia")

df <- data_sim("eg1") |>
  mutate(x1_scl = (x1 - mean(x1)) / sd(x1))
m <- gam(y ~ s(x0) + s(x1_scl) + s(x2) + s(x3), data = df, method = "REML")

sm <- smooth_estimates(m)

# unscale x1_scl
x1_mean <- with(df, mean(x1))
x1_sd <- with(df, sd(x1))

sm <- sm |>
  mutate(x1_scl = (x1_scl * x1_sd) + x1_mean)
draw(sm)

You don't get things like partial residuals or a rug plot, but it is pretty close to what draw() will produce (as draw.gam() calls the draw.smooth_estimates() for each smooth internally).

chrishaak commented 9 months ago

Yep that's correct. Thanks for the suggestion - that is more I less what I've been doing (albeit less efficiently than your approach!).

Wonder if it would be feasible to allow one to supply x-scaling attributes somehow in a call to draw(), so it could apply them "on the fly"?

On Fri, Feb 2, 2024 at 11:22 AM Gavin Simpson @.***> wrote:

I'm fitting to scaled covariate data (using 'scale'), but I'd like to see the "real" (back-transformed) predictor values in my visualizations

I assume you mean that you have y ~ s(x) where x is the result of x <- scale(x_orig)[,1]? There isn't a good solution to that currently. Right now I would suggest that you just evaluate the smooths with smooth_estimates(), and then build your own plots. I'm looking at ways to make building your own plots that look like draw()s plots even easier.

Until then, you could modify the output from smooth_estimates() and then use its draw() method:

library("dplyr") library("mgcv") library("gratia")

df <- data_sim("eg1") |> mutate(x1_scl = (x1 - mean(x1)) / sd(x1)) m <- gam(y ~ s(x0) + s(x1_scl) + s(x2) + s(x3), data = df, method = "REML")

sm <- smooth_estimates(m)

unscale x1_scl

x1_mean <- with(df, mean(x1)) x1_sd <- with(df, sd(x1))

sm <- sm |> mutate(x1_scl = (x1_scl * x1_sd) + x1_mean) draw(sm)

You don't get things like partial residuals or a rug plot, but it is pretty close to what draw() will produce (as draw.gam() calls the draw.smooth_estimates() for each smooth internally).

— Reply to this email directly, view it on GitHub https://github.com/gavinsimpson/gratia/issues/249#issuecomment-1924202491, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORLQXBMTFCVZR4S7QSJ4ZTYRUHEDAVCNFSM6AAAAABCTL6RJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRUGIYDENBZGE . You are receiving this because you were mentioned.Message ID: @.***>