JuliaPlots / StatsPlots.jl

Statistical plotting recipes for Plots.jl
Other
436 stars 88 forks source link

Plots for Statistical Models #290

Open Tokazama opened 4 years ago

Tokazama commented 4 years ago

As I brought up on slack, it would be nice to have a way of directly creating plots form fitted statistical models. I've been able to make a plot similar to this for linear models:

image

There are other plotting packages that allow you to throw your x and y in there and get something similar but this means that you can't actually plot the coefficients of a fitted model that may have other variables that influence estimated coefficients. Just for reference, I've been using GLMs and mixed linear models.

mkborregaard commented 4 years ago

Thanks! So this plot is very simple (you could essentially get this with smooth = true in the plot command). But if we want general plotting of statistical objects:

  1. Would you expect the model to plot the data points as well, or just the model?
  2. What if x is a CategorialArray- would you expect this to dispatch to scatter points with error gars instead?
  3. What if there are two independent variables - would you then expect a 3d plot with a plane? Or a general figure with a scatter plot for each dependent variable? Or would you expect to pass indices or names of the variables to include in the plot to the plotting function?
  4. If we do show the points, would you then expect them to show standardized residuals from the non-shown variables? I'm not really sure in that context what you mean with "throw your x and y in there"
  5. How would you expect a mixed models plot to deal with / show random effects?
Tokazama commented 4 years ago

What I'm personally trying to accomplish is just a line with the error band representing the fixed coefficient estimates. I've been able to hack together something that rips into the guts of a mixed model and do that when I know what coefficient is a slope or intercept. However, I know that what I'm asking requires a lot of other decision to be made for it to work for other people flexibly. So I'm assuming you want me to answer those questions as potential design implementations and not just y special case. Feel free to rein me in if I'm off the mark here.

I would think this would be a two step process. The first part determines what subplots to do given the type of model (e.g., scatter, boxplot, etc.) and then there would be a more deterministic step that can use generic functions (hopefully) to extract the appropriate coefficients from the model to make the series.

So the code would be something like:

plot(fm::StatisticalModel) = [plot(fm, seriestype=s) for s in determine_series_types(fm)]

Now, to answer your questions more directly:

  1. If I wanted the data points I would do scatter(fm) I'd expect it to be the equivalent of scatter( modelmatrix(fm), predict(fm)). If I wanted the raw values I'd just use the existing functionality for plotting raw variables.
  2. If I called plot(fm) and x was categorical I'd expect the determine_series_types to return a boxplot (but that's just personal preference)
  3. I if fm was multivariate I think scatter(fm) would return multiple plots, but it might make sense to also have scatter(fm, coefname=:x1) to plot a single variable. I'm not entirely sure how to do that programmatically because if there were something like and ANCOVA then you could have multiple intercepts and slopes.
  4. Is this question in reference to estimating the error band and if it should be derived from the entire model or just error of the coefficients? I would assume that the error should only reflect the coefficient being plotted. By "throw your x and y in there" I just meant that we can make these plots right now but they don't necessarily have anything to do with our statistical models because they aren't using any of the calculated coefficients.
  5. My personal preference on mixed model plots is probably not the right answer in general. I want my fixed effects plotted like you would any GLM but then I want an extra plot plane showing the variance of each random effect. I know there are problems with completely separating the fixed effects terms from the random effects because one effects another but, but I haven't seen a simple enough plot for that to present to a biologist or doctor (which is what I need).
mkborregaard commented 4 years ago

I'm assuming you want me to answer those questions as potential design implementations

Exactly, I'm leveraging the fact that you have a present interest to pick your brain for ideas :-) In terms of your answer to 5, do you have an example of anyone doing this in practice?

Tokazama commented 4 years ago

I've seen several of varying quality.

The last one is probably the one I trust the most but there aren't many plots there (first

Tokazama commented 4 years ago

I've been thinking about how to actually get this working and have a small example plotting just the confidence band. I've tried to be obnoxiously verbose about the variable names so it's self explanatory.

I think this part is pretty close to being the ideal implementation given the correct arguments:

function xi_error(xi, x̄, stderror_intercept, stderror_slope)
    return sqrt(stderror_intercept + ((x_i - x̄) * stderror_slope)^2)
end

function linear_error_band!(sc, x, ŷ, x̄, stderror_intercept, stderror_slope)
    band_error = [xi_error(x_i) for x_i in x]
    band!(sc, x, ŷ .- band_error, ŷ .+ band_error)
end

The problem is that I don't think there is a common syntax that can be used to extract these. We'd need to be able to do something like:

# name is the model's coefname for x
function linear_error_band!(sc, fm::RegressionModel, name)
    x = modelmatrix(fm)[:,name]
    linear_error_band!(
        sc,
        x,
        response(fm),
        mean(x),
        stderror_intercept(fm, name),
        stderror_slope(fm, name)
       )
end
mkborregaard commented 4 years ago

I was thinking more about using the inbuilt predict method, we just need an accessor to get the X values. For the bivariate case

using GLM, StatsModels, StatsPlots, DataFrames, RDatasets
iris = dataset("datasets", "iris")
mod = lm(@formula(PetalLength ~ PetalWidth), iris)

@recipe function f(mod::RegressionModel)
    newx = [ones(200) range(extrema(mod.model.pp.X[:,2])..., length = 200)]
    newy, l, u = predict(mod, newx, interval = :confidence)
    @series begin
            newx[:,2], newy
    end
    @series begin
        primary := false
        fillalpha --> 0.5
        linealpha := 0
        fillto := l
        newx, u
    end
end

plot(mod)
@df iris scatter!(:PetalWidth, :PetalLength, ms = 2, c = :black)

lr

mkborregaard commented 4 years ago

Note that I'm having trouble with ribbon so I'm using an annoying workaround

Tokazama commented 4 years ago

I don't know if it's useful but I have a an example here that can be used to produce a ribbon and line given two continuous variables.

Tokazama commented 4 years ago

we just need an accessor to get the X values.

This is where I've been getting stuck too. I've been doing stuff like mod.mm.m to access the model matrix (since GLM doesn't use modelmatrix for some reason).

I've been working on some stuff in StatsModels that could hopefully make it easier to map all this information back to the original formula terms, but I don't anticipate quick changes in the JuliaStats repos even if I PR's into StatsModels.

mkborregaard commented 4 years ago

oh they can be quick. I think making the statistical objects offer the correct data back with accessors is the right way to go here

mkborregaard commented 4 years ago

I could make a PR based on this to start with and we could hash it out from there?

using RecipesBase

@recipe function f(mod::RegressionModel)
    newx = [ones(200) range(extrema(mod.model.pp.X[:,2])..., length = 200)]
    newy, l, u = predict(mod, newx, interval = :confidence)
    ribbon := (u-newy, newy-l)
    label --> "regression"
    newx[:,2], newy
end

# use as
using GLM, StatsModels, StatsPlots, DataFrames, RDatasets
iris = dataset("datasets", "iris")
mod = lm(@formula(PetalLength ~ SepalLength), iris)
plot(mod, lw = 4, legend = :topleft)
@df iris scatter!(:SepalLength, :PetalLength, ms = 4, group = :Species, msc = :white)

lr

Tokazama commented 4 years ago

I'd be up for that. I'm sure a lot of people have opinions about this so it would be nice to not solve everything and find out that people disagree with it.

BTW, I have a PR to StatsModels.jl here that starts to solve the problem of indexing formulas. It's definitely a breaking change (switches coefnames outputs to be symbols instead of strings, keeping types consistent between coefficients and terms.), which is why I don't anticipate people being crazy about it.

A less robust but temporary solution to deriving information about the variables we want to plot is something like this:

modelmatrix(mod)[:, findifirst(==(coefficient_name_i_want_to_plot), coefnames(mod))]
kleinschmidt commented 4 years ago

Might make more sense to define the recipes based on the combination of term and model. Then you could have different recipes for continuous vs. categorical terms. And the term gives you all the information you need to generate the x axis (e.g., the ContinuousTerm struct stores the min and max of the original data that generated the schema).

kleinschmidt commented 4 years ago

Getting the correct data back via accessors would be a lot easier with some API changes we've been considering for a long time (requiring that models hold onto their @formula if they have one, and return it via accessors). But there hasn't been a lot of movement on that since everyone's busy. I think @nalimilan made some progress adapting GLM.jl to use this pattern (https://github.com/JuliaStats/GLM.jl/pull/339) but there are some wrinkles to iron out about how to deal with missing data in the input, things like that.

Tokazama commented 4 years ago

Might make more sense to define the recipes based on the combination of term and model. Then you could have different recipes for continuous vs. categorical terms.

This sounds like a good idea. We can build plots around StatsModels terms and then it's easier for packages like GLM to make a custom method for something like plot(::LinPredModel). Maybe in the long run it could also incentivize people to use a common syntax like StatsModels so they get easier access to plot supports for their packages.