Closed TammoR closed 5 years ago
Totally agree for the feature selection case. I found it problematic when using the .complete()
functionality where the function is used to identify columns where values can be imputed. For instance for the MAE dataset neither color nor finish will be recognized as categorical.
ah, good catch, makes sense.
On 9. May 2019, at 17:54, Tammo Rukat notifications@github.com wrote:
Totally agree for the feature selection case. I found it problematic when using the .complete() functionality where the function is used to identify columns where values can be imputed. For instance for the MAE dataset neither color nor finish will be recognized as categorical.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
What do you think>? Should we go ahead with this or should we only adjust the .complete
?
I’m not sure what the disadvantages would be of not doing it vs doing it, but I trust your judgement. And it will fix the problem you mentioned, which is great. I’m just always a bit hesitant when it comes to changing function signatures.
On 9. May 2019, at 19:11, Tammo Rukat notifications@github.com wrote:
What do you think>? Should we go ahead with this or should we only adjust the .complete?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
The previous version of detecting categorical columns was not robust. In particular a single value that occurred rarely made the heuristic fail.
Here I propose to instead use a threshold for the relative column cardinality. This is not perfect but statistically robust.
-- By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.