Closed bkamins closed 12 months ago
Should be adressed by improved preprocessing documentation in: https://evovest.github.io/EvoTrees.jl/dev/tutorials/logistic-regression-titanic/#Preprocessing
Thank you.
I would consider adding the following details to the documentation (or changing the behavior of the package):
Missing
then the feature is dropped; I recommend that you either document this (i.e. that the user needs to call e.g. disallowmissing
function on the input) or that you actually do the check if there are any missing; this is a more general issue, e.g. if column eltype is Any
(which sometimes happens in practice) users should be informed what to do (simplest is to do identity.(col)
to force establishment of a more precise eltype)Bool <: Real
so it would be good to explain why Bool
is explicitly mentioned (I assume it is handled differently internally?)missing
value for unordered categorical variable then I think you could allow missing
in such a variable (and just treat it as a separate level); currently you drop it - this is a soft recommendation; also I would make a separate recommendation for case of missing
in Categorical
column (as then probably it is better to change missing
to a separate level given your current implementation of the package) - the recommendation with a dummy variable is good for Real
variables. (I hope it is clear what I want to say here 😄)missing
is disallowed). I would make it more precise what input is assumed for target input.@jeremiedb - any thoughts on this? I would make a blogpost about updates of EvoTrees after you decide what to do with this issue and tag a release. Hopefully it could help promote this excellent package.
For now my take would be to add a "Missing data" section in the docs (along the Reproducibility one) that clarifies the behavior of the algo. I'm for now reluctant to perform further transformations to the input data or make any assumption of what the intent of the user would have been. My perspetive is for ML algos to be limited to the algo part, while the handling of missings and the likes to be handled by the preprocessing part, which I conceive as a topic of its own within a modeling pipeline. So I'd prefer to direct users to MLJ, TableTransforms or of self-defined preprocessing. I agree though on importance add clarification on the supported feature and column eltypes. I'll look to have those docs updated within 2-3 days.
Sure - if docs are precise what is done in algo and what has to be done in pre-processing this is also OK.
Let me know if you think the above PR provides satisfying clarificationson the handling of missings: https://evovest.github.io/EvoTrees.jl/dev/#Missing-values
Looks good. Thank you!
I checked this part of your tutorial:
https://github.com/Evovest/EvoTrees.jl/blob/main/docs/src/tutorials/logistic-regression-titanic.md?plain=1#L34
and
https://github.com/Evovest/EvoTrees.jl/blob/main/docs/src/tutorials/logistic-regression-titanic.md?plain=1#L35
And it was not fully clear for me what is the recommended practice for both cases from the package maintainers. I.e. what should be the canonical way to preprocess string variables and the canonical way to handle
missing
. (for example in case of missing probably, if such a replacement as suggested in the docs is done another 0-1 feature indicating where a missing value was would be added to avoid loosing information).Also, thank you for using DataFrames.jl :). From this perspective you could write (this is a mild suggestion):
or maybe just simply:
See for the second performance point: