dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
885 stars 255 forks source link

regression metrics raise exception for dask.dataframe.core.Series #756

Open jameslamb opened 3 years ago

jameslamb commented 3 years ago

What happened:

I tried to pass columns from a Dask DataFrame into regression metrics like mean_squared_error(), and this raised errors like

AttributeError: 'Scalar' object has no attribute 'mean'

What you expected to happen:

I expected that I'd be able to pass a column from a Dask DataFrame (which has type dask.dataframe.core.Series) into any of the metrics functions.

Minimal Complete Verifiable Example:

import dask
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster

cluster = LocalCluster()
client = Client(cluster)
cluster

ddf = dask.datasets.timeseries()

from dask_ml.metrics import mean_squared_error

mean_squared_error(
    y_true=ddf["y"],
    y_pred=ddf["y"]
)

Anything else we need to know?:

I looked around and couldn't find documentation that would lead me to think this wouldn't work, or other issues that seemed related.

Environment:

Thanks for your time and consideration

TomAugspurger commented 3 years ago

Thanks for the report. I suspect that we could check the ndim of the inputs somewhere before https://github.com/dask/dask-ml/blob/master/dask_ml/metrics/regression.py#L51 https://github.com/dask/dask-ml/blob/master/dask_ml/metrics/regression.py#L51, and skip that second .mean() if we see that the input ndim is 1?

On Nov 16, 2020, at 11:19 PM, James Lamb notifications@github.com wrote:

What happened:

I tried to pass columns from a Dask DataFrame into regression metrics like mean_squared_error(), and this raised errors like

AttributeError: 'Scalar' object has no attribute 'mean'

What you expected to happen:

I expected that I'd be able to pass a column from a Dask DataFrame (which has type dask.dataframe.core.Series) into any of the metrics functions.

Minimal Complete Verifiable Example:

import dask import dask.dataframe as dd from dask.distributed import Client, LocalCluster

cluster = LocalCluster() client = Client(cluster) cluster

ddf = dask.datasets.timeseries()

from dask_ml.metrics import mean_squared_error

mean_squared_error( y_true=ddf["y"], y_pred=ddf["y"] ) Anything else we need to know?:

I looked around and couldn't find documentation that would lead me to think this wouldn't work, or other issues that seemed related.

Environment:

Dask version (output of pip freeze | grep -E "dask|distributed") dask==2.30.0 dask-cloudprovider==0.4.1 dask-glm==0.2.0 dask-ml==1.7.0 distributed==2.30.1 Python version: 3.8.3.final.0 Operating System: macOS 10.14.6 Install method (conda, pip, source): pip Thanks for your time and consideration

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/756, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITCOHJL4EHAXSF73ATSQIBXRANCNFSM4TYCX5UA.

jameslamb commented 3 years ago

Is it fair to say that supporting dd.Series inputs is something that I should expect to work in these metrics functions?

Looking at it again, I see that the type hint on these functions is ArrayLike, and that that is

ArrayLike = TypeVar("ArrayLike", Array, np.ndarray) 

https://github.com/dask/dask-ml/blob/c55c1898bcf05ccd0c572d7df793d938a3b7e9af/dask_ml/_typing.py#L9

If it's expected to work with Series, or something you'd welcome, I'd be happy to submit a PR to add that for metrics.

TomAugspurger commented 3 years ago

Yeah, we’d definitely like to support Series inputs here.

On Nov 17, 2020, at 11:42 AM, James Lamb notifications@github.com wrote:

Is it fair to say that supporting dd.Series inputs is something that I should expect to work in these metrics functions?

Looking at it again, I see that the type hint on these functions is ArrayLike, and that that is

ArrayLike = TypeVar("ArrayLike", Array, np.ndarray) https://github.com/dask/dask-ml/blob/c55c1898bcf05ccd0c572d7df793d938a3b7e9af/dask_ml/_typing.py#L9 https://github.com/dask/dask-ml/blob/c55c1898bcf05ccd0c572d7df793d938a3b7e9af/dask_ml/_typing.py#L9 If it's expected to work with Series, or something you'd welcome, I'd be happy to submit a PR to add that for metrics.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/756#issuecomment-729090468, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOISLXZAP72QVD4QRY5DSQKY2HANCNFSM4TYCX5UA.