Extrapolation using a model extracted from `geom_smooth()` function

sinaneza commented 7 years ago

Dear all

I have a major trouble in extrapolating by a model (loess(), lm(), ...). I know that interpolation is possible using predict(). However, I am really experiencing a trouble with extrapolation.

Furthermore, I don't know how I can extract the function by which the curve in geom_smooth() is drawn.

I would appreciate if anyone could help me with these issues.

@jennybc @daattali @sjackman @oganm

jennybc commented 7 years ago

If you want to have the fitted model that you see in ggplot2, you will have to fit it yourself, rather than rely on ggplot2 to do it.

As for "fit then predict", a small example showing your problem is needed.

sjackman commented 7 years ago

Hi, Sina. See R for Data Science by Hadley Wickham and Garrett Grolemund

https://www.amazon.ca/Data-Science-Transform-Visualize-Model/dp/1491910399
http://r4ds.had.co.nz/model-basics.html#predictions
https://github.com/hadley/r4ds/blob/master/model-basics.Rmd#predictions
and the exercises of §23.3.3

sinaneza commented 7 years ago

@sjackman Thank you very much Shawn I need to read through the links. In the case of having troubles with the instructions, I will let you know by a comment on this issue.

sinaneza commented 7 years ago

Dear @jennybc

To my point of view, your suggestion is just applicable when you want to fit the model to the interval in which your data frame is spanned (interpolation).

However, in the case that I want to find an information for a value outside the interval in which data is spanned, predict() fails.

The code is as follow:

rm(list = ls())
library("tidyverse")
mtcars
as_tibble(mtcars)
model <- loess(wt ~ hp, data=mtcars, normalize = FALSE)
xrange <- range(mtcars$hp)
xseq <- seq(from=xrange[1], to=xrange[2], length=100)
pred <- predict(model, newdata = data.frame(hp = xseq), se=TRUE)
y <-  pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
loess.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)

ggplot(mtcars, aes(x=hp, y=wt)) +
    geom_point() +
    geom_line(aes(x = x, y= y), stat = "identity", data = select(loess.DF, c(x, y)))
    # geom_smooth(aes_auto(loess.DF), data=loess.DF, stat="identity")

ggplot(mtcars, aes(x=hp, y=wt)) +
    geom_point() +
    geom_smooth()

xseq2 <- seq(from = xrange[2], to = xrange[2]+300, length = 80)

pred2 <- predict(model, newdata2 = data.frame(hp = xseq2), se = TRUE)
y2 <- pred2$fit

y2
loess.DF2 <- data.frame(x = xseq2, y = y2)

sjackman commented 7 years ago

As far as I understand it, the same code should work for both interpolation and extrapolation, in that you'll get a prediction. Whether the prediction makes any sense in the real world depends on your model.

Could you please copy-and-paste the error message that you're seeing? Use https://github.com/tidyverse/reprex

sinaneza commented 7 years ago

Dear @sjackman

The error is:

Error in data.frame(x = xseq2, y = y2) : arguments imply differing number of rows: 80, 32

Which means that it had not computed wt for the the xseq2 since their size are totally different. Infact, model$fit gives us exactly the same vector as y2 in the previously given code

sinaneza commented 7 years ago

@sjackman In fact, in the case that it had computed wt correctly for xseq2, it should have the same length as xseq2.

sjackman commented 7 years ago

Try using modelr::add_predictions

sinaneza commented 7 years ago

Dear @sjackman

To my understanding so far, the issue was not related to predict() and add_predict(). The issue was the model selection, as I selected the "polynomial function" for my model it was able to do extrapolation. However, changing the model to "B-spline" and "local regression", extrapolation gives 'N/A' as the result.

BTW, I was wondering to know whether "local regression" and "B-Spline" methods are just applicable to interpolation or extrapolation is also applicable to them

The applied code to find the mentioned issue is as follow:

rm(list = ls())
library("tidyverse")
mtcars
as_tibble(mtcars)

model <- loess(wt ~ hp, data=mtcars)
model2 <- lm(wt ~ splines::bs(hp, 3), data = mtcars)
model3 <- lm(wt ~ poly(hp, 3), data = mtcars)
xrange <- range(mtcars$hp)
xseq <- seq(from=xrange[1], to=xrange[2], length=100)
pred <- predict(model, newdata = data.frame(hp = xseq), se=TRUE)
y <-  pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
loess.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)

ggplot(mtcars, aes(x=hp, y=wt)) +
    geom_point() +
    geom_line(aes(x = x, y= y), stat = "identity", data = select(loess.DF, c(x, y)))
    # geom_smooth(aes_auto(loess.DF), data=loess.DF, stat="identity")

ggplot(mtcars, aes(x=hp, y=wt)) +
    geom_point() +
    geom_smooth()

xseq2 <- seq(from = xrange[2], to = xrange[2]+300, length = 80)

pred2 <- predict(model, newdata2 = data.frame(hp = xseq2), se = TRUE)
y2 <- pred2$fit

y2
loess.DF2 <- data.frame(x = xseq2, y = y2)

XSeq_DF <- data_frame(hp = xseq)
XSeq2_DF <- data_frame(hp = xseq2)
library(modelr)
add_predictions(XSeq_DF, model)
add_predictions(XSeq2_DF, model)

add_predictions(XSeq_DF, model2)
add_predictions(XSeq2_DF, model2)

A <- add_predictions(XSeq_DF, model3)
add_predictions(XSeq2_DF, model3)

predict(model3, XSeq2_DF)

X_DF <- data_frame(hp = seq(from = xrange[1], to = xrange[2]+300, length = 200)) %>% 
    add_predictions(model3)
filter(X_DF, hp>=335)
ggplot(data = mtcars, aes(x = hp, y = wt)) + 
    geom_point() +
    geom_smooth(method = "lm", formula = y ~ poly(x, 3)) +
    geom_line(data = A,
                        aes(x = hp, y = pred),
                        colour = "red") +
    geom_line(data = filter(X_DF, hp < 400),
                        aes(x = hp, y = pred),
                        colour = "green")

sjackman commented 7 years ago

The issue was the model selection, as I selected the "polynomial function" for my model it was able to do extrapolation. However, changing the model to "B-spline" and "local regression", extrapolation gives 'N/A' as the result.

Ah, I see. Yes it seems reasonable that some models may be incapable of extrapolation. I'm afraid that I'm out of my comfort zone here though. Sorry. Did polynomial function work for you, then?

BTW, I was wondering to know whether "local regression" and "B-Spline" methods are just applicable to interpolation or extrapolation is also applicable to them

I'm afraid that I don't know the answer. :pensive:

When the fit was made using surface = "interpolate" (the default), predict.loess will not extrapolate – so points outside an axis-aligned hypercube enclosing the original data will have missing (NA) predictions and standard errors.

See https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.loess.html

STAT545-UBC / Discussion

Extrapolation using a model extracted from `geom_smooth()` function #491