Inconsistent UnitTest Results on MacOS

adithyabsk commented 6 years ago

I was running test cases on my mac and it seems that some of the tests were failing due to the results not being what was expected. I was lead on this path while running a toy example with the random seed set which produced different results and I found that the unit tests were failing on the MacOS platform. For example:

Traceback (most recent call last): File ".../auto-sklearn/test/test_pipeline/components/regression/test_base.py", line 97, in test_default_boston_iterative_sparse_fit "default_boston_iterative_sparse_places", 7)) AssertionError: -4.3762864606281644e+27 != -5.121789391983587e+27 within 7 places

Here is a toy example:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from autosklearn.classification import AutoSklearnClassifier

seed = 0
np.random.seed(seed)
X = np.array([0] * 50 + [1] * 50).reshape((-1, 1))
y = np.array([0] * 50 + [1] * 50)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

est = AutoSklearnClassifier(time_left_for_this_task=20, seed=seed)
est.fit(X, y)
print(est.predict_proba(X_test))

My output between runs would be variable. For example:

[[0.95693356 0.04306644] [0.04306707 0.95693293] [0.95693356 0.04306644] [0.04306707 0.95693293] [0.04306707 0.95693293] [0.04306707 0.95693293] [0.95693356 0.04306644] [0.04306707 0.95693293] [0.04306707 0.95693293] [0.04306707 0.95693293]]

--and--

[[0.94460547 0.05539453] [0.05342438 0.94657562] [0.94460547 0.05539453] [0.05342438 0.94657562] [0.05342438 0.94657562] [0.05342438 0.94657562] [0.94460547 0.05539453] [0.05342438 0.94657562] [0.05342438 0.94657562] [0.05342438 0.94657562]]

I would get a variable number of these errors specifically in the regression and classification unit test sections. Do you have any idea what might be causing this.

Relavent versioning info: MacOS 10.13.6 Python 3.6 sklearn 0.19.1 autosklearn 0.4.0

mfeurer commented 6 years ago

Unfortunately, I don't know what's happening here. As there is no fast and open CI system for MacOS we can also not provide running unit tests and therefore not support it. However, as long as only the performance comparisons are off by a bit it should not be a big deal.

Just out of curiosity: is this a system python or did you install it with AnaConda?

adithyabsk commented 6 years ago

It is python, the reproducibility of results is quite important for my use case so I will see if I can figure it out

adithyabsk commented 6 years ago

Also it seems that the slow startup times for MacOS Travis CI builds might have been solved. https://github.com/travis-ci/travis-ci/issues/7304

adithyabsk commented 6 years ago

As a followup, I've found that even on linux systems that the above toy example seems to provide differing results. Is there any way to set the limits of autosklearn on a runs or iterations basis to get deterministic results? @mfeurer

mfeurer commented 6 years ago

Please excuse my initial, not very helpful answer. What you're seeing here is most likely some small variation due to time limits and random effects introduced by them. To get rid of such effects, you need to remove all time limits and run Auto-sklearn for a specific number of iterations instead. Please see https://github.com/automl/auto-sklearn/issues/451 for an example.

adithyabsk commented 6 years ago

This looks like what I need, thank you!

adithyabsk commented 6 years ago

Hmm.... so I followed the instructions from the cited issue and it seems that I am still getting results that vary. To be absolutely certain that it wasn't something related to my testing setup (linux system), I pulled the git repo and ran the tests on master. All of the test cases passed. I have also listed the modified toy example below and some sample results.

import numpy as np
from sklearn.model_selection import train_test_split
from autosklearn.classification import AutoSklearnClassifier

seed = 0
np.random.seed(seed)
X = np.array([0] * 50 + [1] * 50).reshape((-1, 1))
y = np.array([0] * 50 + [1] * 50)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

est = AutoSklearnClassifier(time_left_for_this_task=40,
                            ensemble_size=0,
                            seed=seed,
                            include_preprocessors=['no_preprocessing'],
                            include_estimators=["liblinear_svc", ],
                            smac_scenario_args={'runcount_limit': 5})

est.fit(X_train, y_train)
est.fit_ensemble(y_train, ensemble_size=50)
print(est.predict_proba(X_test))
# print(est.show_models())

The outputs:

[[0.71628886 0.28371114] [0.27903489 0.72096511] [0.71628886 0.28371114] [0.27903489 0.72096511] [0.27903489 0.72096511] [0.27903489 0.72096511] [0.71628886 0.28371114] [0.27903489 0.72096511] [0.27903489 0.72096511] [0.27903489 0.72096511]]

--and--

[[0.71632133 0.28367867] [0.27900239 0.72099761] [0.71632133 0.28367867] [0.27900239 0.72099761] [0.27900239 0.72099761] [0.27900239 0.72099761] [0.71632133 0.28367867] [0.27900239 0.72099761] [0.27900239 0.72099761] [0.27900239 0.72099761]]

adithyabsk commented 6 years ago

@mfeurer It seems this might be two separate issues: one with test cases failing on the mac and one with reproducibility on linux systems. Should I split these into two issues?

Also, not sure if this might help with debugging this but it seems that even with a fixed number of runs, numpy's "random function" is called a differing number of times between runs with a fixed seed. I overwrote numpy's random setup using the following snippet which I inserted into the code above. fit ensemble seems to consistently call random 50 times whereas the actual fit method itself runs a variable number of times ranging from 300 to 500 times overall.

# snippet
from forbiddenfruit import curse
import random

i = 0
def randint(self, low, high=None, size=None, dtype='l'):
    global i
    # curframe = inspect.currentframe()
    # calframe = inspect.getouterframes(curframe, 2)
    # i+=calframe[1][3]+'\n'
    val =  random.randint(low, high-1) if low is not None and high is not None else random.randint(0, low-1)
    i+=1 # '{}\n'.format(val)
    return val
    # val = low if high is not None else low-1
    # if size is not None: 
    #     return np.full(size, val).astype(dtype)
    # else:
    #     return val

curse(np.random.RandomState, 'randint', randint)

mfeurer commented 6 years ago

Thanks for digging into that. I expected the following script to be deterministic, but it turns out it isn't:

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

def main():
    X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = \
        sklearn.model_selection.train_test_split(X, y, random_state=1)

    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=1000000000,
        per_run_time_limit=86400,
        ml_memory_limit=8000,
        tmp_folder='/tmp/autosklearn_holdout_example_tmp',
        output_folder='/tmp/autosklearn_holdout_example_out',
        disable_evaluator_output=False,
        smac_scenario_args={
            'runcount_limit': 5,
            'deterministic': 'true',
            'intensification_percentage': 0.000000001
        },
        delete_tmp_folder_after_terminate=False,
        ensemble_size=0,
        initial_configurations_via_metalearning=0
    )
    automl.fit(X_train, y_train, dataset_name='digits')
    automl.fit_ensemble(y_train, ensemble_size=1)

    # Print the final ensemble constructed by auto-sklearn.
    print(automl.show_models())
    predictions = automl.predict(X_test)
    # Print statistics about the auto-sklearn run such as number of
    # iterations, number of models failed with a time out.
    print(automl.sprint_statistics())
    print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

if __name__ == '__main__':
    main()

I just had a brief look at the code and there is at least one issue in autosklearn.ensembles.ensemble_selection. I can't find the underlying issue at the moment, so you need to wait a bit or go into the code yourself to figure out why the number of calls differs - maybe you could print the traceback and see where the additional calls originate.

adithyabsk commented 5 years ago

@mfeurer Based on the fixes for #517 I tested my snippet again I continued to get different results... yet when I ran your snipped I began to get consistent results. I started to try to pare down your snippet to the essentials and it seems that the program hangs if I allow the fitting processes to automatically construct the ensemble model, which I am quite unsure of as to why (is it because of the passing of smac args?). Is it possible to have auto-sklearn build the ensemble and produce consistent results in one go?

mfeurer commented 5 years ago

the program hangs if I allow the fitting processes to automatically construct the ensemble model, which I am quite unsure of as to why

That is surprising and I don't know why this would/should happen.

Is it possible to have auto-sklearn build the ensemble and produce consistent results in one go?

I expected this to happen with the snippet. Does this issue happen with your specific dataset or a simple example dataset?

adithyabsk commented 5 years ago

Both datasets, though it maybe as a result of my misuse of the SMAC args as it doesn't seem to be new behavior (0.4.2 produces the same freezing). The following hangs for me in both versions (I let it run for about 10 minutes each time, just to be certain). Note that I commented out the ensembling portions of the setup and execution code.

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

def main():
    X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = \
        sklearn.model_selection.train_test_split(X, y, random_state=1)

    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=1000000000,
        per_run_time_limit=86400,
        ml_memory_limit=8000,
        tmp_folder='/tmp/autosklearn_holdout_example_tmp',
        output_folder='/tmp/autosklearn_holdout_example_out',
        disable_evaluator_output=False,
        smac_scenario_args={
            'runcount_limit': 5,
            'deterministic': 'true',
            'intensification_percentage': 0.000000001
        },
        delete_tmp_folder_after_terminate=True,
        # ensemble_size=0,
        # initial_configurations_via_metalearning=0
    )
    automl.fit(X_train, y_train, dataset_name='digits')
    # automl.fit_ensemble(y_train, ensemble_size=1)

    # Print the final ensemble constructed by auto-sklearn.
    print(automl.show_models())
    predictions = automl.predict(X_test)
    # Print statistics about the auto-sklearn run such as number of
    # iterations, number of models failed with a time out.
    print(automl.sprint_statistics())
    print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

if __name__ == '__main__':
    main()

mfeurer commented 5 years ago

Thanks for sharing the script. Indeed, there is currently an issue because of the way time_left_for_this_task has to be specified, which results in the ensemble script not shutting down. I'm afraid that for now you have to either build the ensemble afterwards (by commenting in the fit_ensemble in the end) or submit a patch to Auto-sklearn which fixes this behavior.

mfeurer commented 3 years ago

Closing this as we a) currently still don't support OSX b) the issue of having a high runtime while giving the number of iterations was fixed. Please open a new issue if you're still having problems with Auto-sklearn.

automl / auto-sklearn

Inconsistent UnitTest Results on MacOS #514