dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
902 stars 256 forks source link

Implement MissingIndicator for SimpleImputer. #494

Open stsievert opened 5 years ago

stsievert commented 5 years ago

Looks like a indicator_ attribute was added to sklearn.impute.SimpleImputer in https://github.com/scikit-learn/scikit-learn/pull/12583. This is causing the sklearn-dev tests to fail in #221.

Here's the documentation for the new attribute:

    indicator_ : :class:`sklearn.impute.MissingIndicator`
        Indicator used to add binary indicators for missing values.
        ``None`` if add_indicator is False.

Here's the message from the sklearn-dev tests on #221:

# test_fit_constant[data1] - tests.test_impute
data = dask.array<array, shape=(10, 4), dtype=float64, chunksize=(5, 4)>

    @pytest.mark.parametrize("data", [X, dX, df, ddf])
    def test_fit_constant(data):
        a = sklearn.impute.SimpleImputer(strategy="constant", fill_value=-999.0)
        b = dask_ml.impute.SimpleImputer(strategy="constant", fill_value=-999.0)

        expected = a.fit_transform(X)
        result = b.fit_transform(data)

>       assert_estimator_equal(a, b)

tests/test_impute.py:42: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

left = SimpleImputer(add_indicator=False, copy=True, fill_value=-999.0,
              missing_values=nan, strategy='constant', verbose=0)
right = SimpleImputer(add_indicator=False, copy=True, fill_value=-999.0,
              missing_values=nan, strategy='constant', verbose=0)
exclude = set(), kwargs = {}, left_attrs = ['indicator_', 'statistics_']
right_attrs = ['statistics_']

    def assert_estimator_equal(left, right, exclude=None, **kwargs):
        """Check that two Estimators are equal

        Parameters
        ----------
        left, right : Estimators
        exclude : str or sequence of str
            attributes to skip in the check
        kwargs : dict
            Passed through to the dask `assert_eq` method.

        """
        left_attrs = [x for x in dir(left) if x.endswith("_") and not x.startswith("_")]
        right_attrs = [x for x in dir(right) if x.endswith("_") and not x.startswith("_")]
        if exclude is None:
            exclude = set()
        elif isinstance(exclude, str):
            exclude = {exclude}
        else:
            exclude = set(exclude)

>       assert (set(left_attrs) - exclude) == set(right_attrs) - exclude
E       AssertionError

dask_ml/utils.py:81: AssertionError
TomAugspurger commented 5 years ago

thanks. I'm pushing up a commit now adding add_indicator to the signature and raising if it's set.

We can implement add_indicator later if there's interest.

link to docs: https://scikit-learn.org/dev/modules/generated/sklearn.impute.SimpleImputer.html

TomAugspurger commented 5 years ago

Repurposed this issue to implement add_indicator.

nikitamalhotra commented 5 years ago

What verbose indicates here? Can you please help me in understanding it?

TomAugspurger commented 5 years ago

It should have the same meaning as in scikit-learn.

On Thu, Oct 3, 2019 at 7:19 AM Nikita Malhotra notifications@github.com wrote:

What verbose indicates here? Can you please help me in understanding it?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/494?email_source=notifications&email_token=AAKAOIWOHDJP2VY7MK5QMK3QMXPN7A5CNFSM4HFRLDFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAIAI4I#issuecomment-537920625, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIVUN522FAGW7DLIQO3QMXPN7ANCNFSM4HFRLDFA .

nikitamalhotra commented 5 years ago

Obviously it is the class of Scikit and your answer is not an answer. I wanted to know the purpose of verbose and you did not tell me.

RoyalTS commented 4 years ago

How would you want add_indicator implemented? The same way as in sklearn, including MissingIndicator() and all?

TomAugspurger commented 4 years ago

I think ideally it would be the same as scikit-learn, but I don't quite understand what that would be just from reading the documentation.

On Wed, Oct 28, 2020 at 1:45 PM Tobias Schmidt notifications@github.com wrote:

How would you want add_indicator implemented? The same way as in sklearn, including MissingIndicator() and all?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/494#issuecomment-718135405, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIRKUQUMVF6PM77XSELSNBRGNANCNFSM4HFRLDFA .