Evovest / EvoTrees.jl

Boosted trees in Julia
https://evovest.github.io/EvoTrees.jl/dev/
Apache License 2.0
175 stars 21 forks source link

Document how missing values should be handled by user #247

Closed bkamins closed 12 months ago

bkamins commented 1 year ago

I checked this part of your tutorial:

https://github.com/Evovest/EvoTrees.jl/blob/main/docs/src/tutorials/logistic-regression-titanic.md?plain=1#L34

and

https://github.com/Evovest/EvoTrees.jl/blob/main/docs/src/tutorials/logistic-regression-titanic.md?plain=1#L35

And it was not fully clear for me what is the recommended practice for both cases from the package maintainers. I.e. what should be the canonical way to preprocess string variables and the canonical way to handle missing. (for example in case of missing probably, if such a replacement as suggested in the docs is done another 0-1 feature indicating where a missing value was would be added to avoid loosing information).

Also, thank you for using DataFrames.jl :). From this perspective you could write (this is a mild suggestion):

transform!(df, :Sex => ByRow(!=("male")) => :Sex) # gives Bool column, but I guess it is OK for your package
transform!(df, :Age => (x -> coalesce.(x, mean(skipmissing(df.Age)))) => :Age) # this will be more efficient as it avoids computing the mean repeatedly

or maybe just simply:

df.Sex = df.Sex .!== "male"
df.Age = coalesce.(df.Age, mean(skipmissing(df.Age)))

See for the second performance point:

julia> x = rand([1.0, 2.0, missing], 10^4);

julia> f1(v) = ismissing(v) ? mean(skipmissing(x)) : v
f1 (generic function with 1 method)

julia> f2(x) = coalesce.(x, mean(skipmissing(x)))
f2 (generic function with 1 method)

julia> @time f1.(x);
  0.107405 seconds (20.01 k allocations: 390.969 KiB)

julia> @time f1.(x);
  0.110699 seconds (20.01 k allocations: 390.969 KiB)

julia> @time f2(x);
  0.000142 seconds (2 allocations: 78.172 KiB)

julia> @time f2(x);
  0.000124 seconds (2 allocations: 78.172 KiB)
jeremiedb commented 1 year ago

Should be adressed by improved preprocessing documentation in: https://evovest.github.io/EvoTrees.jl/dev/tutorials/logistic-regression-titanic/#Preprocessing

bkamins commented 1 year ago

Thank you.

I would consider adding the following details to the documentation (or changing the behavior of the package):

bkamins commented 1 year ago

@jeremiedb - any thoughts on this? I would make a blogpost about updates of EvoTrees after you decide what to do with this issue and tag a release. Hopefully it could help promote this excellent package.

jeremiedb commented 1 year ago

For now my take would be to add a "Missing data" section in the docs (along the Reproducibility one) that clarifies the behavior of the algo. I'm for now reluctant to perform further transformations to the input data or make any assumption of what the intent of the user would have been. My perspetive is for ML algos to be limited to the algo part, while the handling of missings and the likes to be handled by the preprocessing part, which I conceive as a topic of its own within a modeling pipeline. So I'd prefer to direct users to MLJ, TableTransforms or of self-defined preprocessing. I agree though on importance add clarification on the supported feature and column eltypes. I'll look to have those docs updated within 2-3 days.

bkamins commented 1 year ago

Sure - if docs are precise what is done in algo and what has to be done in pre-processing this is also OK.

jeremiedb commented 12 months ago

Let me know if you think the above PR provides satisfying clarificationson the handling of missings: https://evovest.github.io/EvoTrees.jl/dev/#Missing-values

bkamins commented 12 months ago

Looks good. Thank you!