make heuristic that detects categorical columns robust.

awslabs / datawig

Imputation of missing values in tables.

Apache License 2.0

478 stars 69 forks source link

make heuristic that detects categorical columns robust. #106

Closed TammoR closed 5 years ago

TammoR commented 5 years ago

The previous version of detecting categorical columns was not robust. In particular a single value that occurred rarely made the heuristic fail.

Here I propose to instead use a threshold for the relative column cardinality. This is not perfect but statistically robust.

-- By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

TammoR commented 5 years ago

Totally agree for the feature selection case. I found it problematic when using the .complete() functionality where the function is used to identify columns where values can be imputed. For instance for the MAE dataset neither color nor finish will be recognized as categorical.

felixbiessmann commented 5 years ago

ah, good catch, makes sense.

On 9. May 2019, at 17:54, Tammo Rukat notifications@github.com wrote:

Totally agree for the feature selection case. I found it problematic when using the .complete() functionality where the function is used to identify columns where values can be imputed. For instance for the MAE dataset neither color nor finish will be recognized as categorical.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

TammoR commented 5 years ago

What do you think>? Should we go ahead with this or should we only adjust the .complete?

felixbiessmann commented 5 years ago

I’m not sure what the disadvantages would be of not doing it vs doing it, but I trust your judgement. And it will fix the problem you mentioned, which is great. I’m just always a bit hesitant when it comes to changing function signatures.

On 9. May 2019, at 19:11, Tammo Rukat notifications@github.com wrote:

What do you think>? Should we go ahead with this or should we only adjust the .complete?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.