JuliaStats / GLM.jl

Generalized linear models in Julia
Other
595 stars 114 forks source link

`wts` option doesn't automatically subset when a DataFrame has missing values #255

Open pdeffebach opened 6 years ago

pdeffebach commented 6 years ago

If I have a weight variable that's part of the DataFrame, I might want to do this:

glm(@formula(y ~ x), df, ..., wts = df[:weight]

However this doens't work if x or y contain missing values.

ERROR: LoadError: DimensionMismatch("wts must have length 4011 or length 0 but was 9447")

glm should subset a weights vector the same way it subsets the DataFrame.

Additionally, if a weight vector is of type Union{T, Missing} GLM throws the following error:

ERROR: LoadError: TypeError: in #fit, in typeassert, expected Array{Float64,1}, got Array{Union{Missing, Float64},1}

It seems like GLM should use duck-typing here, and chug along even if the weight vector could conceivably contain missing values.

nalimilan commented 6 years ago

Rather than duck typing, we could at least do <:Union{Real,Missing} (as in many other places in JuliaStats packages).

But I'm afraid there's a more fundamental issue here: missing values are skipped by the code in StatsModels, and AFAIK GLM has no idea about them. We need to change this. Potentially related to https://github.com/JuliaStats/StatsModels.jl/issues/32.

andreasnoack commented 6 years ago

I think it would be nice if we could leave missing values out of this package. I don't know if it is practical but I think we should try to come up with a way to handle this at the StatsModels level such that the weights vector doesn't have any missing values once it is here in this package. @kleinschmidt Have you thought about weights?

pdeffebach commented 6 years ago

I know this has been repeated a lot, but ultimately, what I would really like is for a weight and cluster to be inside the formula object and have all of those handled at once. If that's feasible that seems like the best solution in the long run.

nalimilan commented 6 years ago

It sounds a bit unusual to put weights in the formula, but why not... In terms of implementation that's not really simple though since that means GLM needs to be aware of StatsModels to extract the weights from the object it returns. That's completely different from the current system.

Whatever the chosen design, GLM doesn't and shouldn't have to deal with missing values in weights. What it would need to know is which elements have to be dropped because they have been skipped by StatsModels, or directly get the subsetted weights vector from StatsModels.

pdeffebach commented 6 years ago

GLM needs to be aware of StatsModels to extract the weights from the object it returns.

Could you expand on this? A weight is just like a covariate: it's a vector of observed data you use to estimate a coefficient. Both clustering and weights aren't separate from covariates in my mind.

dmbates commented 6 years ago

The reason for having an intermediate step of Original Data Frame -> Model Frame -> Model Matrix in R was to isolate operations like skipping rows with missing data, only considering variables that are used in the model, in the first step. Evaluations of model matrices, model response, constrasts, weights, etc. are done after that step.

dmbates commented 6 years ago

That was also the reason for the bizarre "Standard Non-Standard R Evaluation Model" that I mentioned in a slack discussion with @kleinschmidt . The idea was to have all the arguments to the model-fitting function available to be examined at the time of creating a model frame. Of course R also has a delayed evaluation model which would allow these kinds of shenanigans but means that, effectively, it is impossible to compile an R function effectively because you can't tell what the types of the arguments are going to be until you are in the middle of a function. So I am not advocating this behavior - I am just saying that it has special treatment in R.

pdeffebach commented 6 years ago

That makes sense, thanks.

Evaluations of model matrices, model response, constrasts, weights, etc. are done after that step.

I guess my point is a philosophical one, in that weights are just as much as part of how I imagine a "model" as any covariate. So they should be included in the @formula call.

nalimilan commented 6 years ago

Could you expand on this? A weight is just like a covariate: it's a vector of observed data you use to estimate a coefficient. Both clustering and weights aren't separate from covariates in my mind.

Currently GLM just takes the model matrix from StatsModels, and combines it with the weights which are provided directly. If weights are moved to the formula, GLM will have to take also take them from StatsModels. Or StatsModels will have to call GLM and pass it the weights, for example as a keyword argument. The decisive point is whether GLM needs to be aware of the existence of StatsModels or not (currently it isn't, but see JuliaStats/StatsModels.jl#32).

pdeffebach commented 6 years ago

I see. StatsModels just knows how to make (X, y). This is all @formula serves to keep track of. And the whole pipeline would need to be expanded to allow (X, y, weight, cluster, strata).

Sounds good. Let me know if there are any good introductory PRs for this if we decide to go that direction.

nalimilan commented 6 years ago

Maybe one simple solution after https://github.com/JuliaStats/StatsModels.jl/pull/71 would be to allow packages to declare that a special function term (like weights) should be passed as a keyword argument to the package's fitting function after dropping observations with missing values, or stored as a field in the model, using the same name. That would be general enough to accommodate the needs of various model families.

kleinschmidt commented 6 years ago

I think it would definitely be possible to treat weights that way (using a special function within the formula) but it feels a little weird that the syntax would look so different. Then again, the fit(GLM, X, y) syntax looks very different from the fit(GLM, @formula( y ~ x1 + x2 + x3)) so maybe that's not such a big deal. Another possibility would be to allow KW args in the formula expression (as in JuliaStats/StatsModels.jl#21).

Along those lines it might also be worth thinking about a general kind of "data transformer" component for the model fitting API, which could be a formula but could be something else too. It seems like one of the components of that would have to be expanding data in "raw" form (e.g., a table) into a combination of positional and/or keyword arguments which get spliced into calls to predict, fit, etc.

kleinschmidt commented 6 years ago

Also, to @dmbates comment about the motivation for the "standard non-standard evaluation", that might motivate having a separate @model or @fit API like @fit(GLM, y ~ x1+x2+x3, weights=~wt, another_kw=1) that similarly has access to all the arguments; there are problems there with being able to store the extracted transformations so that they can be re-used for predict later, unless a wrapper type is created.

But perhaps that could expand to something like fit(GLM, Deferred(), Deferred(), weights=Deferred(), another_kw=1, transformer=@formula(...)) where Defferred is some kind of singleton type like Missing that tells fit to get values from the transformer or something I dunno just spitballin'