JuliaStats / StatsModels.jl

Specifying, fitting, and evaluating statistical models in Julia
248 stars 31 forks source link

Incorrect Default Coding of Union{Missing, Int or Float64} #145

Open clintonTE opened 5 years ago

clintonTE commented 5 years ago

Pretty straight forward. I would expect the results of these to be the same:

function mwe(N=10)
  df = DataFrame(x = rand(N),
    zm = Vector{Union{Missing, Int}}(rand(1:4,N)),
    z = Vector{Int}(rand(1:4,N)))

  println("W/out missings")
  f = @eval(@formula(x ~ z))
  f = apply_schema(f, schema(f,df), StatisticalModel)
  m = modelcols(f.rhs, df)
  display(m)

  println("With missings")
  f = @eval(@formula(x ~ zm))
  f = apply_schema(f, schema(f,df), StatisticalModel)
  m = modelcols(f.rhs, df)
  display(m)
end

mwe()

OUTPUT:

W/out missings
10×2 Array{Float64,2}:
 1.0  3.0
 1.0  3.0
 1.0  2.0
 1.0  2.0
 1.0  1.0
 1.0  2.0
 1.0  4.0
 1.0  1.0
 1.0  4.0

With missings
10×4 Array{Float64,2}:
 1.0  0.0  1.0  0.0
 1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0
 1.0  1.0  0.0  0.0
 1.0  0.0  0.0  1.0
 1.0  0.0  1.0  0.0
kleinschmidt commented 5 years ago

Note that if you pass ContinuousTerm as a hint to the schema you should get the desired result:

julia> modelcols(apply_schema(f, schema(f, df, Dict(:zm => ContinuousTerm)), StatisticalModel), df)[2]
10×2 Array{Float64,2}:
 1.0  1.0
 1.0  2.0
 1.0  2.0
 1.0  3.0
 1.0  3.0
 1.0  3.0
 1.0  3.0
 1.0  2.0
 1.0  3.0
 1.0  1.0

That being said, it might make sense to widen the criterion for whether something is continuous or not to be Union{Missing, <:Real} instead of just <:Real. And leave it up to the user to decide how to handle missings in the model matrix that results...

kleinschmidt commented 5 years ago

Actually, if there are any missing values it will error:

julia> df.zm[2] = missing
missing

julia> modelcols(apply_schema(f, schema(f, df, Dict(:zm => ContinuousTerm)), StatisticalModel), df)[2]
ERROR: MethodError: no method matching copy(::Missing)
matthieugomez commented 5 years ago

Still, the issue is what happens when the variable is Vector{Union{<:Real, Missing} but does not contain missing values. This is a pretty common situation, for instance:

using DataFrames
df = DataFrame(y = [missing, 1, 2])
idx = completecases(df)
subdf = df[idx, :]
clintonTE commented 5 years ago

Still, the issue is what happens when the variable is Vector{Union{<:Real, Missing} but does not contain missing values. This is a pretty common situation, for instance:

using DataFrames
df = DataFrame(y = [missing, 1, 2])
idx = completecases(df)
subdf = df[idx, :]

This. A lot of my regression code follows a similar pattern before getting the model matrix. Also worth pointing out that the old pattern of ModelMatrix(ModelFrame(.)) treated Vector{Float64} and Vector{Union{Float64,Missing}} the same (as continuous), at least in the original use case.

Moreover, I really like how StatsModels has had common-sense rules for continuous variables in most reasonable cases. Some intuition for ideal behavior: A typical "raw" dataframe model and formula will involve Dates, Ints, Symbols, Strings, and Float64s, and the various permutations combined with Missing. Except for T<:Union{Number,Missing}, I would expect Dummy Coding to be default and forT<:Union{Number, Missing} I would think it would be Continuous by default. I guess I would prefer that as a bright-line rule than relying on behind the scenes inference based on the assortment and uniqueness of values. (Of course, this implies that if the column in the DataFrame is already designated as a CategoricalValue or similar, that would also be obeyed.)

kleinschmidt commented 5 years ago

I think it makes sense to treat Union{Missing, <:Number} as continuous. I'll do a PR later this week or happily review one if someone else wants to make one before then!

matthieugomez commented 4 years ago

Note that for now

using StatsModels
N = 10_000_000
x = rand(N) + rand([0, missing], N)
df = (x = x,)
schema(Term(:x), df)
#julia(20271,0x116991dc0) malloc: can't allocate region
#:*** mach_vm_map(size=199992320073728, flags: 60000100) failed (error code=3)
#julia(20271,0x116991dc0) malloc: *** set a breakpoint in malloc_error_break to debug
#141 ERROR: OutOfMemoryError(
palday commented 4 years ago

@matthieugomez That's not too surprising given the underlying issue (and how it expands out), but thanks for the example.

matthieugomez commented 4 years ago

yes, just connecting it to https://github.com/FixedEffects/FixedEffectModels.jl/issues/99