nobs() should be number of obs; wobs() should be current nobs

JuliaStats / GLM.jl

Generalized linear models in Julia

Other

595 stars 114 forks source link

nobs() should be number of obs; wobs() should be current nobs #259

Open iwelch opened 6 years ago

iwelch commented 6 years ago

nobs should probably return nrow(m.mf.df), an integer. otherwise, it seems like a misnomer. it is also unexpected to get a Float for standard use(s).

the current nobs should/could probably be named wobs. with weights all equal to 1, it is the same as nobs(), albeit Float.

/iaw

pdeffebach commented 6 years ago

To compare with Stata

reg y x [pw = w] displays the sum of weights, but does not store it in e()
svy: reg y x where svyset [pw = w] does indeed store the weights in e(). It uses e(N) for the number of rows and e(N_pop) for the sum of weights.

nalimilan commented 6 years ago

As noted on Discourse:

I’m afraid it’s more complex than that. For example, with frequency/replicate weights, the apparent “number of observations” doesn’t have any meaning, it’s just the way the data has been compressed to save space. So it would be misleading to have nobs return that.

A solution would be to have a keyword argument to request the (unweighted) number of rows.

pdeffebach commented 6 years ago

Would you be open to exporting a function that inspects the model frame in the output for the number of rows in the underlying data set?

However I understand that we want to be agnostic about the input data type.

nalimilan commented 6 years ago

We would need to require a specific layout from all models to do that (https://github.com/JuliaStats/StatsModels.jl/issues/32). Barring that solution, it doesn't seem to hard to require models to implement that simple method.

pdeffebach commented 6 years ago

Thanks for the link. If the officially sanctioned API for all models is still moving, I would like for some sort of unweightedobs() function to be implemented.

However I generally write closures for any regression function, including a custom output struct. So it's not a huge deal if I have to write a function to get the unweighted N.

pdeffebach commented 6 years ago

For the sake of completeness:

felm in the R package lfe returns a model where you can do

m$M: number of rows in the matrix
m$weights: the vector of weights used in the regression.