dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
893 stars 255 forks source link

Unpredictable results when some NaNs included in input #947

Open AlexeyPechnikov opened 1 year ago

AlexeyPechnikov commented 1 year ago

NaN values can't be used to fit a model and usually should be excluded. But dask_ml allows NaNs just to return wrong output:

from sklearn.pipeline import make_pipeline
from dask_ml.preprocessing import StandardScaler
from dask_ml.linear_model import LinearRegression

X = 1.*np.array([[1, 1], [1, 2], [2, 1], [2, 2]])
y = np.array([ 6.,  8.,  9., np.nan])

reg = make_pipeline(StandardScaler(), LinearRegression())
reg.fit(X, y)
print (reg.predict(np.array([[3., 5.]])))
[10.9140625]

reg = make_pipeline(StandardScaler(), LinearRegression())
reg.fit(X[~dask.array.isnan(y)], y[~dask.array.isnan(y)])
print (reg.predict(np.array([[3., 5.]])))
[15.54899511]

Sure, reg.fit(X[~dask.array.isnan(y)], y[~dask.array.isnan(y)]) is the correct way but it works extremely slow on big datasets due to slow y[~dask.array.isnan(y)] or y[~np.isnan(y)] calculation. It'd be nice to allow dask_ml ignore NaNs but the result is wrong. While dask_ml provides SimpleImputer(strategy='mean') that's a terrible idea to use 1D mean value to fill multidimensional data gaps.

Maybe is there any approach to just exclude NaNs for dask_ml functions by scalable way?