CONABIO / antares3

Madmex with open data cube and in python3
2 stars 2 forks source link

model.remove_outliers error #37

Open palmoreck opened 6 years ago

palmoreck commented 6 years ago

when max_samples = 0.632 there's an error in remove_outliers of model :

https://github.com/CONABIO/antares3/blob/develop/madmex/modeling/__init__.py#L92

...
   X, y = Model.remove_outliers(X, y)
  File "/home/madmex_user/.local/lib/python3.5/site-packages/madmex/modeling/__init__.py", line 128, in remove_outliers
    isolation_forest.fit(g[1])
  File "/usr/local/lib/python3.5/dist-packages/sklearn/ensemble/iforest.py", line 201, in fit
    sample_weight=sample_weight)
  File "/usr/local/lib/python3.5/dist-packages/sklearn/ensemble/bagging.py", line 306, in _fit
    raise ValueError("max_samples must be in (0, n_samples]")
ValueError: max_samples must be in (0, n_samples]

check implementation:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/iforest.py#L235

and docu:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html

http://scikit-learn.org/stable/auto_examples/covariance/plot_outlier_detection.html

check why max_samples = 0.632 @loicdtx ?

Now using max_samples = 'auto'

https://github.com/CONABIO/antares3/commit/0709b6ee423683c768f0b153fdd3363bfe181da6

jequihua commented 6 years ago

Whut? I'm pretty sure I tested this. Docu. states:

max_samples : int or float, optional (default=”auto”)

The number of samples to draw from X to train each base estimator.

        If int, then draw max_samples samples.
        If float, then draw max_samples * X.shape[0] samples.
        If “auto”, then max_samples=min(256, n_samples).

It's 0.632 because its a good proportion of data for bootstrap estimates. Each tree of the isolation forest will be built with 0.632 of the input table rows.

max_samples = 'auto'

Will build each tree with 256 observations which with our data size seems a tad small. Especially given the number of trees (n_estimators). Most data will not even be seen. Anyway, I'll give it a check.

loicdtx commented 6 years ago

There's a reproducible example in the docstring in case that helps.

from sklearn.datasets import make_classification
from madmex.modeling import BaseModel
X, y = make_classification(n_samples=10000, n_features=10,
                           n_classes=5, n_informative=6)
X_clean, y_clean = BaseModel.remove_outliers(X, y)
print('Input shape:', X.shape, 'Output shape:', X_clean.shape)