Open palmoreck opened 6 years ago
Whut? I'm pretty sure I tested this. Docu. states:
max_samples : int or float, optional (default=”auto”)
The number of samples to draw from X to train each base estimator.
If int, then draw max_samples samples.
If float, then draw max_samples * X.shape[0] samples.
If “auto”, then max_samples=min(256, n_samples).
It's 0.632 because its a good proportion of data for bootstrap estimates. Each tree of the isolation forest will be built with 0.632 of the input table rows.
max_samples = 'auto'
Will build each tree with 256 observations which with our data size seems a tad small. Especially given the number of trees (n_estimators). Most data will not even be seen. Anyway, I'll give it a check.
There's a reproducible example in the docstring in case that helps.
from sklearn.datasets import make_classification
from madmex.modeling import BaseModel
X, y = make_classification(n_samples=10000, n_features=10,
n_classes=5, n_informative=6)
X_clean, y_clean = BaseModel.remove_outliers(X, y)
print('Input shape:', X.shape, 'Output shape:', X_clean.shape)
when max_samples = 0.632 there's an error in remove_outliers of model :
https://github.com/CONABIO/antares3/blob/develop/madmex/modeling/__init__.py#L92
check implementation:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/iforest.py#L235
and docu:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
http://scikit-learn.org/stable/auto_examples/covariance/plot_outlier_detection.html
check why max_samples = 0.632 @loicdtx ?
Now using
max_samples = 'auto'
https://github.com/CONABIO/antares3/commit/0709b6ee423683c768f0b153fdd3363bfe181da6