Develop out S3 methods for predict and broom generics

mattwarkentin commented 4 years ago

Hi @chjackson,

So I am going to try and summarize the "TODO"s based on previous discussions. It seems like most of the computations needed for this method are already implemented via summary() so it seems like a it could be just a matter of wrapping summary() and making sure the returned object adheres to some of the principles described in the #39

Here are some of the design considerations I gleaned from the other thread:

predict() should always return as many rows as inputs in newdata, and in the same order. Extra "rows" for various quantiles/times should be packed into list-columns
Need to decide on which types (type = ...) are returned.
Should newdata be required? Currently the archived data in the flexsurvreg model object is used, but based on this they suggest the function should error-out without new_data to predict on. Thoughts?
predict() should not return covariate values used for prediction
Should handle missing data gracefully

I am now looking over the source code here to see how this aligns with the above comments.

Look forward to hearing your thoughts.

mattwarkentin commented 4 years ago

I've forked the repo and I'm starting to play around with some of the changes. I do have some questions for you. Happy to discuss in this thread, could also do a Zoom chat if it is easier to discuss things that way.

chjackson commented 4 years ago

Thanks for looking at this. That all sounds sensible. If you concentrate on handling those list columns and documenting how users should deal with them, then I'll handle any loose ends.

I don't see what's wrong with new_data defaulting to the data used for the original fit. @topepo , do you know why this was recommended against? Generally I'm happy to follow the tidymodels guidelines if I can understand the rationale for them!

topepo commented 4 years ago

There are a few (debatable) reasons.

In many implementations, this means that the object stores the predictions for the training set. This was a good idea 20 years ago but is problematic now since many people use large data sets. We want to avoid storing things that can be easily recreated.
We really want to discourage people from re-predicting the data used to build the model. If they really want to, they can just run predict() to get there. There are many people that don't know better and it seems like a default that may do damage to their analysis should be avoided.

mattwarkentin commented 4 years ago

The first point is well-taken, but I don't imagine there is any desire to remove the training data from the flexsurvreg model object, so the data are available to use anyway. If storing the data is bloating the model object we could think about writing a method for the generics defined in the {butcher} package to remove the data or other large objects.

The second point, while I definitely agree with the logic, does seem a bit hand-holdy. I guess it depends on how opinionated Chris wants this function to be in regards to gently pushing users toward better modeling practices.

chjackson commented 4 years ago

Yes I see the reason for not storing the data in the model object, though as Matt says, flexsurv has always done this, perhaps unwisely, so it's probably too late to change this now, and we might as well make use of it for predict.flexsurvreg.

I don't understand the second point though. In typical implementations, e.g. predict.lm, predict.glm(..., type="response"), what is returned by default is not the data itself, but the fitted values from a model. Those are useful for checking model fit, though there is a different generic stats:::fitted.values() that might be more appropriate for that. What is the bad practice?

topepo commented 4 years ago

In general, those fitted values are optimistically biased since the model is just repredicting itself. For low-bias models (like simple linear models) the bias is manageable. Otherwise, it can lead to highly optimistic assessments of performance. This is fairly well-known but a good example of that can be found here. As an overall policy, we eschew doing this.

Splines could lead to overfitting if someone were to turn up the degrees of freedom to a large value.

mattwarkentin commented 4 years ago

@chjackson Once predict.flexsurvreg() is up and running, is there interest in developing a residuals.flexsurvreg() method (akin to those for coxph, survreg, etc.)?

I started to make a first pass at writing up some of the {broom} methods for tidy.flexsurvreg(), glance.flexsurvreg(), and augment.flexsurvreg() and it seems to me that adding an S3 method for the residuals generic might be a logical next step. Thoughts?

topepo commented 4 years ago

We just updated the docs on making custom tidiers. A new version of broom will be out in a few months too. When you have a PR for this, case you tag myself and @simonpcouch (our broom intern) so that they are future-proofed?

mattwarkentin commented 4 years ago

@topepo Sure, no problem. I'll read over the new broom docs before submitting the PR and I will let you both know once it's submitted.

chjackson commented 4 years ago

Thanks both - this is interesting stuff. flexsurv needs better tools for model checking. Typical practice I think is just to plot fitted survival curves against Kaplan-Meier estimates, and compare AIC. I don't know much about residuals in survival analysis, but I sometimes get asked. So a residuals method would be good, along with a user guide on about how they should be used sensibly as part of a model assessment workflow. I think there is a place for in-sample model checking, particularly in the smaller-data settings that flexsurv is widely used in - you can quickly rule out bad models and get an idea how they might be improved.

mattwarkentin commented 4 years ago

@chjackson I have a question about how the tidy.flexsurvreg(...) method should work. Here is what I have at the moment:

library(flexsurv)
fitg <- flexsurvspline(Surv(futime, fustat) ~ age, data = ovarian, k = 5)
tidy(fitg)

# A tibble: 8 x 5
  term   estimate std.error statistic p.value
  <chr>     <dbl>     <dbl>     <dbl>   <dbl>
1 gamma0  -32.3     13.6       -2.38  0.0142 
2 gamma1    4.00     2.80       1.43  0.0850 
3 gamma2    5.15     4.22       1.22  0.119  
4 gamma3  -18.5     47.5       -0.390 0.351  
5 gamma4   14.4    103.         0.141 0.445  
6 gamma5  -19.9    140.        -0.142 0.444  
7 gamma6   28.9     90.6        0.319 0.377  
8 age       0.175    0.0510     3.43  0.00150

Users can also specify some additional options to get CIs and also to exponentiate the coefficients to get HRs.

tidy(fitg, conf.int = TRUE, conf.level = 0.95, exponentiate = TRUE)

# A tibble: 8 x 7
  term   estimate std.error statistic p.value  conf.low conf.high
  <chr>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>     <dbl>
1 gamma0 9.57e-15   7.71e 5    -2.38  0.0142  2.77e- 26 3.30e-  3
2 gamma1 5.48e+ 1   1.65e 1     1.43  0.0850  2.26e-  1 1.33e+  4
3 gamma2 1.72e+ 2   6.79e 1     1.22  0.119   4.42e-  2 6.70e+  5
4 gamma3 9.05e- 9   4.37e20    -0.390 0.351   3.18e- 49 2.57e+ 32
5 gamma4 1.85e+ 6   3.32e44     0.141 0.445   1.02e- 81 3.37e+ 93
6 gamma5 2.30e- 9   5.69e60    -0.142 0.444   1.92e-128 2.75e+110
7 gamma6 3.39e+12   2.16e39     0.319 0.377   2.74e- 65 4.20e+ 89
8 age    1.19e+ 0   1.05e 0     3.43  0.00150 1.08e+  0 1.32e+  0

However, I don't know that it is reasonable to exponentiate the gammas when requesting exponentiated coefficients. And perhaps its not even sensible to return the gammas along with the coefs. This also applies to flexsurvreg() objects which return other non-beta parameters depending on the distribution chosen. We could follow suit with mixed models, whereby one can request different components of the model separately (see below). What are your thoughts?

model <- mixed_model(...)
tidy(model, effects = "fixed")
tidy(model, effects = "random")

Maybe something like...

tidy(fitg, effect = "coefs")
tidy(fitg, effect = "dist")

Side note: the statistic and p.value column are quick hacks I added using Wald-type test statistics (estimate / std.err) and p.values based on N - k degrees of freedom from a t-distribution. This is completely liable to be changed or removed. I was just trying to match the expected column names for other {broom} tidiers.

mattwarkentin commented 4 years ago

Here is what I've got working...

fitg <- flexsurvreg(Surv(futime, fustat) ~ age, data = ovarian, dist = "gengamma")

tidy(fitg, effect = "coefs") # this is the default, effect can be omitted

# A tibble: 1 x 5
  term  estimate std.error statistic p.value
  <chr>    <dbl>     <dbl>     <dbl>   <dbl>
1 age    -0.0875    0.0250     -3.50 0.00101

tidy(fitg, effect = "distp") # distp for distribution parameters (name can be changed)

# A tibble: 3 x 5
  term  estimate std.error statistic     p.value
  <chr>    <dbl>     <dbl>     <dbl>       <dbl>
1 mu      11.7       1.66      7.05  0.000000227
2 sigma   -0.286     0.244    -1.17  0.127      
3 Q        0.295     0.912     0.323 0.375

chjackson commented 4 years ago

Thanks. I think it would make sense for a tidy method to present by default the things that are printed by default by print.flexsurvreg. And be generous in what is included by default, as long as it makes sense. So I'd include confidence intervals by default, as well as standard errors, as they're easier to interpret than standard errors. And I'd return both baseline parameters of survival distributions and covariate effects by default. But it makes sense to have options to control what parameters or summaries are returned, as you've done.

About the scales: by default print.flexsurvreg presents baseline parameters on their natural (non-log) scales and covariate effects on the real-line scale (log), but I think there should be an option to exponentiate covariate effects, or transform the baseline parameters to their log scales (as stored in the $res.t component of flexsurvreg objects). Not sure on the best names for those arguments though.

I would definitely not return test statistics and p-values for things other than covariate effects. For baseline survival model parameters it usually doesn't make sense to test the hypothesis that the parameter is zero. The exception is things like testing whether the shape parameter of a Weibull or Gamma is 1, because that gives the exponential distribution, but there's currently no information stored to say which parameters have this property. It's always bugged me how lm, glm and the like present p-values for intercepts in regression models. I've so often seen people unthinkingly pasting software output into their papers, without considering whether that p=0.000001 is meaningful...

mattwarkentin commented 4 years ago

Thanks for the detailed response. My initial concern with returning all of the distribution parameters and covariate effects together was that the estimate column would contain values on two scales by default (non-log and log for distribution and covariates, respectively). At that point, It only makes sense to exponentiate the log covariate effects, meanwhile it only makes sense to log the non-log distribution parameters (x$res.t).

We could return the distribution parameters on their log-scale by default (x$res.t), that way when users exponentiate the estimates you end up with distribution parameters on their natural scale and hazard ratios, for example. What do you think about that?

I tend to agree with you about not include test statistics and p-values for hypotheses not-worth-testing. I can have NA for distribution parameters, and only keep test statistics and p-values for covariate effects.

mattwarkentin commented 4 years ago

I now see not all of the distribution parameters are log-transformed in x$res.t so then exponentiation doesn't make sense.

fitg$res

              est       L95%        U95%        se
mu    11.67353273  8.4262007 14.92086481 1.6568325
sigma  0.75128848  0.3975646  1.41972991 0.2439556
Q      0.29479306 -1.4922845  2.08187058 0.9117910
age   -0.08750552 -0.1365231 -0.03848799 0.0250094

fitg$res.t

              est       L95%        U95%        se
mu    11.67353273  8.4262007 14.92086481 1.6568325
sigma -0.28596558 -0.9223978  0.35046665 0.3247163
Q      0.29479306 -1.4922845  2.08187058 0.9117910
age   -0.08750552 -0.1365231 -0.03848799 0.0250094

Perhaps exponentiate can only be true when requesting covariate effects only?

chjackson commented 4 years ago

This is a tricky one! Having a consistent scale is a good principle, but that conflicts with some other good principles.

One of these principles is to avoid confusion about parameterisation of distributions. Parametric models are so often presented in different parameterisations, e.g. there are several different ways of writing the Weibull and Gamma, which confuses people. We can't change what other software does, but we can try to be internally consistent, and consistent with core R. So a key principle of flexsurv is that the inputs and outputs for flexsurv's functions should have the same meaning as the parameters in the d/p/q/r functions used to define the parametric distribution. So I wouldn't want to have a parameter called "shape" in the output, whose primary estimate is given by default as the log shape. We could name it log(shape) I suppose, though it'd be fiddly to handle this for distributions that have parameters with transforms other than log.

Another principle is consistency with print.flexsurvreg - users will naturally print the object they have just created when they fit the model, and won't they expect a tidy function to just tidy up the output that they see? It's too late to change what print.flexsurvreg does.

So instead of a global "exponentiate" argument, I'd prefer to have two different arguments, because they do logically different things. For example they could be called "baseline.real" to transform the baseline distribution parameters to their real-line scales (while leaving the covariate effects alone), and "exponentiate.effects" to exponentiate the covariate effects (while leaving the baseline parameters alone). Though I'm up for suggestions of nicer names...

@topepo - I was going to ask if that was tidymodels-friendly - but then I checked the new broom docs and saw that list of currently acceptable argument and column names. This list seems like it will grow endlessly with different kinds of models! Is it OK to add new arguments like this that are logically descriptive of what they do? I don't think users should be forced to split up the output into baseline parameters and covariate effects - they're just generalisations of linear models, where it's usual to present intercepts together with covariate effects.

mattwarkentin commented 4 years ago

Hmm, you make great points. This is a good conversation to have now. With enough thought and discussion we should be able to create an API that is intuitive, consistent, and stable.

I agree that keeping consistent with the d/p/q/r and print.flexsurvreg() is a good principle to try and adhere too.

Maybe we should have an argument called transform which can accept a character vector that can be mixed-and-matched:

# Example
tidy(x, transform = c("baseline.real", "exponentiate.effects"))

Or something like that? We could also go the list-column route again, and pack covariate effects and baseline parameters into list-columns, but maybe this adds too much unnecessary complexity. It probably also deviates from {broom} normalcy.

glm and lm do present intercepts and covariate effects together, but usually all parameters are on the same scale, right?

I'll wait to see what Max thinks about all this. I will also @simonpcouch to see if he wants to offer some insight.

simonpcouch commented 4 years ago

A few thoughts on conventions (or soon-to-be conventions) in broom...

I checked the new broom docs and saw that list of currently acceptable argument and column names. This list seems like it will grow endlessly with different kinds of models!

Definitely a fair criticism! We largely use the glossaries to ensure internal consistency (i.e. the same column names are used to describe the same kinds of output throughout the package.) You're right, though, that sometimes new models require unique arguments and output columns.

Is it OK to add new arguments like this that are logically descriptive of what they do?

Totally fair game to add new arguments / column names that aren't in the glossaries for tidier methods that aren't included in broom "proper." The reason we've included those glossaries in the article is that they are utilized by the modeltests package providing generalized unit testing functions for tidier methods implemented in broom, and we'd hope that the output of tidier methods exported out of non-broom packages is as consistent with broom outputs as possible. We're hoping to make modeltests a more helpful tool for building/testing tidier methods exported outside of broom in the future. If you decide to use arguments/output columns that aren't in the glossaries but would still like to use the package for unit testing, feel free to Issue/PR there.

Also, not sure where the current source drafts for these methods are, but re:

conf.int = TRUE, conf.level = 0.95, exponentiate = TRUE

In broom, the convention is that conf.int defaults to FALSE, conf.level defaults to .95, and exponentiate defaults to FALSE (except for a couple cases for backward compatibility).

The discussion on the form of the exponentiation argument is a tricky one.

I don't think users should be forced to split up the output into baseline parameters and covariate effects - they're just generalisations of linear models, where it's usual to present intercepts together with covariate effects.

Would each of the columns in the output have a consistent meaning and definition regardless of whether they were describing baseline parameters or covariate effects? I'd say it's fine to report them together if so. Thinking about this especially re:

the statistic and p.value column are quick hacks I added using Wald-type test statistics (estimate / std.err) and p.values based on N - k degrees of freedom from a t-distribution. This is completely liable to be changed or removed. I was just trying to match the expected column names for other {broom} tidiers.

No worries if a tidier method omits columns (i.e. p.value) that seem to be common elsewhere but don't have an obvious definition in a certain model setting. If you decide to include the statistic and p.value columns, I'd recommend using custom documentation rather than the standard broom ones that clarify how they were generated and what they mean.

My two cents for now. Hope this was helpful.🙂

chjackson commented 4 years ago

Thanks - that's helpful!

The issue I think is whether "estimate" implies "estimate on an unrestricted real scale, unless otherwise specified". The prominence of the "exponentiate" argument in the broom documentation might imply this. On the other hand, R packages tend to present parameter estimates on whatever scale the package author deems sensible! Real scales tend to be math/computer friendly, but less user friendly.

mattwarkentin commented 4 years ago

I think my biggest concern was that - in my opinion - when a transformation is applied to a column, it would be expected that the transformation is applied to every value in that column (as would be implied with exponentiate = TRUE). But this isn't strictly relevant for some of the values.

So I think it makes sense to allow users to request transformations to the baseline parameters separate from the covariate effects. To keep internally consistent with both flexsurv and broom, this would mean returning baseline parameters on their "natural" scale and covariate effects on their log-scale. Agreed, @chjackson?

mattwarkentin commented 4 years ago

Okay, so this took a little more fuss than I thought it would, but I actually feel pretty good about this implementation:

ex1 <- flexsurvspline(Surv(futime, fustat) ~ age, data = ovarian, k = 5)
ex2 <- flexsurvreg(Surv(futime, fustat) ~ age, data = ovarian, dist = "gengamma")

tidy(ex1)
# A tibble: 8 x 7
  term   estimate std.error statistic  p.value conf.low conf.high
  <chr>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 gamma0  -32.3     13.6        NA    NA       -5.88e+1    -5.71 
2 gamma1    4.00     2.80       NA    NA       -1.49e+0     9.49 
3 gamma2    5.15     4.22       NA    NA       -3.12e+0    13.4  
4 gamma3  -18.5     47.5        NA    NA       -1.12e+2    74.6  
5 gamma4   14.4    103.         NA    NA       -1.86e+2   215.   
6 gamma5  -19.9    140.         NA    NA       -2.94e+2   254.   
7 gamma6   28.9     90.6        NA    NA       -1.49e+2   206.   
8 age       0.175    0.0510      3.43  3.03e-4  7.50e-2     0.275

tidy(ex2)
# A tibble: 4 x 7
  term  estimate std.error statistic   p.value conf.low conf.high
  <chr>    <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 mu     11.7       1.66       NA    NA           8.43    14.9   
2 sigma   0.751     0.244      NA    NA           0.398    1.42  
3 Q       0.295     0.912      NA    NA          -1.49     2.08  
4 age    -0.0875    0.0250     -3.50  0.000234   -0.137   -0.0385

tidy(ex2, transform = "baseline.real")
# A tibble: 4 x 7
  term  estimate std.error statistic   p.value conf.low conf.high
  <chr>    <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 mu     11.7       1.66       NA    NA           8.43    14.9   
2 sigma  -0.286     0.325      NA    NA          -0.922    0.350 
3 Q       0.295     0.912      NA    NA          -1.49     2.08  
4 age    -0.0875    0.0250     -3.50  0.000234   -0.137   -0.0385

tidy(ex2, transform = "coefs.exp")
# A tibble: 4 x 7
  term  estimate std.error statistic   p.value conf.low conf.high
  <chr>    <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 mu      11.7       1.66      NA    NA           8.43     14.9  
2 sigma    0.751     0.244     NA    NA           0.398     1.42 
3 Q        0.295     0.912     NA    NA          -1.49      2.08 
4 age      0.916     1.03      -3.50  0.000234    0.872     0.962

tidy(ex2, transform = c("coefs.exp", "baseline.real"))
# A tibble: 4 x 7
  term  estimate std.error statistic   p.value conf.low conf.high
  <chr>    <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 mu      11.7       1.66      NA    NA           8.43     14.9  
2 sigma   -0.286     0.325     NA    NA          -0.922     0.350
3 Q        0.295     0.912     NA    NA          -1.49      2.08 
4 age      0.916     1.03      -3.50  0.000234    0.872     0.962

tidy(ex2, transform = "coefs.exp", conf.level = 0.99)
# A tibble: 4 x 7
  term  estimate std.error statistic   p.value conf.low conf.high
  <chr>    <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 mu      11.7       1.66      NA    NA           7.41     15.9  
2 sigma    0.751     0.244     NA    NA           0.326     1.73 
3 Q        0.295     0.912     NA    NA          -2.05      2.64 
4 age      0.916     1.03      -3.50  0.000234    0.859     0.977

tidy(ex2, transform = "coefs.exp", conf.int = FALSE)
# A tibble: 4 x 5
  term  estimate std.error statistic   p.value
  <chr>    <dbl>     <dbl>     <dbl>     <dbl>
1 mu      11.7       1.66      NA    NA       
2 sigma    0.751     0.244     NA    NA       
3 Q        0.295     0.912     NA    NA       
4 age      0.916     1.03      -3.50  0.000234

tidy(ex2, pars = "coefs")
# A tibble: 1 x 7
  term  estimate std.error statistic  p.value conf.low conf.high
  <chr>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 age    -0.0875    0.0250     -3.50 0.000234   -0.137   -0.0385

tidy(ex2, pars = "bdist")
# A tibble: 3 x 7
  term  estimate std.error statistic p.value conf.low conf.high
  <chr>    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 mu      11.7       1.66         NA      NA    8.43      14.9 
2 sigma    0.751     0.244        NA      NA    0.398      1.42
3 Q        0.295     0.912        NA      NA   -1.49       2.08

I welcome any feedback.

mattwarkentin commented 4 years ago

While were at it, here is the current state of the glance() method...

glance(ex2)

# A tibble: 1 x 8
      N events censored trisk    df logLik   AIC   BIC
  <int>  <int>    <int> <dbl> <int>  <dbl> <dbl> <dbl>
1    26     12       14 15588     3  -90.0  186.  190.

Happy to hear what you think.

chjackson commented 4 years ago

Both of these look really nice to me! The "transform" argument nicely allows new kinds of transforms to be added cleanly in the future without creating a new argument.

Just spotted that "bdist" and "baseline.real" should be named consistently like "coefs" and "coefs.exp" are. Mild preference for changing "bdist" to "baseline".

mattwarkentin commented 4 years ago

Closed with #77

chjackson / flexsurv

Develop out S3 methods for predict and broom generics #75