dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
892 stars 255 forks source link

Misleading error when non-numeric inputs fed to SimpleImputer #567

Open Bonesters opened 4 years ago

Bonesters commented 4 years ago

Currently, if a dataframe with non-numeric values is passed into the simple imputer and the strategy is mean or median, it will give an error like ValueError: Length of passed values is 2, index implies 3. This is because the mean and quantile functions automatically exclude any non-numeric columns. If numeric_only=False gets passed to the mean and quantile functions, it would give a more straightforward error like TypeError: could not convert string to float: 'xyz' for mean, and TypeError: can't multiply sequence by non-int of type 'float' for median.

stsievert commented 4 years ago

Thanks for the bug report @Bonesters!

It sounds like you know how to fix this issue. It'd be great to have a pull request for this!

If not, could you provide a minimal working example? Here's some tips: http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

Bonesters commented 4 years ago

The pull request has been created. My initial idea didn't work, but I came up with a solution that should match the behavior of sklearn's SimpleImputer in addition to giving a clearer error.