automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.49k stars 1.27k forks source link

TimeSeries Split #501

Open kanchanapadmanabhan opened 6 years ago

kanchanapadmanabhan commented 6 years ago

The problem I want to use auto-sklearn on is a time-series. Can we modify sklearn to include cv with time series?

adithyabsk commented 6 years ago

From my experience, cross validation is usually not done with time series data. Something akin to walk forward analysis is done with time series data. Though, I'm not too sure how that would be implemented here. https://en.wikipedia.org/wiki/Walk_forward_optimization

kanchanapadmanabhan commented 6 years ago

@adithyabsk When I said time series cross-validation, that's kind of what I meant. We have a sliding window to train and test over the time series. I work on retail data and all of our datasets are time-series and we need to respect that to be able to train our models.

mfeurer commented 6 years ago

This should work by passing an instance of sklearn.model_selection.TimeSeriesSplit to the AutoSklearnClassifier. I haven't tried this yet so please let me know if this works.

vincentmele commented 5 years ago

This doesn't work with TimeSeriesSplit. It does work with other sklearn.model_selection choices like KFold and StratifiedKFold as the resampling_strategy.

Note the AssertionError: (1123, 1347) error received from line 300 of evaluation/abstract_evaluator.py.

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

import shutil

def main():

    # Delete the temp folders
    shutil.rmtree('/tmp/autosklearn_cv_example_tmp',ignore_errors=True)
    shutil.rmtree('/tmp/autosklearn_cv_example_out',ignore_errors=True)

    X, y = sklearn.datasets.load_digits(return_X_y=True)
    X_train, X_test, y_train, y_test = \
        sklearn.model_selection.train_test_split(X, y, random_state=1)

    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=120,
        per_run_time_limit=30,
        tmp_folder='/tmp/autosklearn_cv_example_tmp',
        output_folder='/tmp/autosklearn_cv_example_out',
        delete_tmp_folder_after_terminate=False,
        #resampling_strategy='cv',
        resampling_strategy=sklearn.model_selection.TimeSeriesSplit,
        resampling_strategy_arguments={'folds': 5},
    )

    # fit() changes the data in place, but refit needs the original data. We
    # therefore copy the data. In practice, one should reload the data
    automl.fit(X_train.copy(), y_train.copy(), dataset_name='digits')
    # During fit(), models are fit on individual cross-validation folds. To use
    # all available data, we call refit() which trains all models in the
    # final ensemble on the whole dataset.
    automl.refit(X_train.copy(), y_train.copy())

    print(automl.show_models())

    predictions = automl.predict(X_test)
    print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

if __name__ == '__main__':
    main()

results in:

[ERROR] [2018-12-31 19:05:01,043:AutoML(1):digits] Error creating dummy predictions: {'traceback': 'Traceback (most recent call last):
  File "/home/vince/anaconda3/lib/python3.6/site-packages/autosklearn/evaluation/__init__.py", line 30, in fit_predict_try_except_decorator
      return ta(queue=queue, **kwargs)
        File "/home/vince/anaconda3/lib/python3.6/site-packages/autosklearn/evaluation/train_evaluator.py", line 806, in eval_cv
            evaluator.fit_predict_and_loss()
              File "/home/vince/anaconda3/lib/python3.6/site-packages/autosklearn/evaluation/train_evaluator.py", line 250, in fit_predict_and_loss
                  final_call=True
                    File "/home/vince/anaconda3/lib/python3.6/site-packages/autosklearn/evaluation/abstract_evaluator.py", line 243, in finish_up
                        train_pred, valid_pred, test_pred,
                          File "/home/vince/anaconda3/lib/python3.6/site-packages/autosklearn/evaluation/abstract_evaluator.py", line 300, in calculate_auxiliary_losses
      Y_train_pred.shape[0],
      AssertionError: (1123, 1347)
      ', 'error': 'AssertionError((1123, 1347),)', 'configuration_origin': 'DUMMY'}
tdrxy commented 5 years ago

I also have issues running with TimeSeriesSplit, where each new tried model call ends in:

[ERROR] [2019-02-01 16:47:36,728:AutoML(1):prediction] Error creating dummy predictions: {'traceback': 'Traceback (most recent call last):\n File "/home/tdrxy/anaconda3/envs/mlpy3/lib/python3.7/site-packages/autosklearn/evaluation/init.py", line 30, in fit_predict_try_except_decorator\n return ta(queue=queue, **kwargs)\n File "/home/tdrxy/anaconda3/envs/mlpy3/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py", line 806, in eval_cv\n evaluator.fit_predict_and_loss()\n File "/home/tdrxy/anaconda3/envs/mlpy3/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py", line 250, in fit_predict_and_loss\n final_call=True\n File "/home/tdrxy/anaconda3/envs/mlpy3/lib/python3.7/site-packages/autosklearn/evaluation/abstract_evaluator.py", line 243, in finish_up\n train_pred, valid_pred, test_pred,\n File "/home/tdrxy/anaconda3/envs/mlpy3/lib/python3.7/site-packages/autosklearn/evaluation/abstract_evaluator.py", line 300, in calculate_auxiliary_losses\n Y_train_pred.shape[0],\nAssertionError: (629, 754)\n', 'error': 'AssertionError((629, 754))', 'configuration_origin': 'DUMMY'}

mfeurer commented 5 years ago

Thanks for reporting this. I'm afraid that we won't have the time to look into this ourselves. Therefore, any debugging and/or pull requests to fix this are warmly welcome.

janvanrijn commented 5 years ago

Hi @mfeurer, @wangcan04 is a student of Holger at LIACS, and needs this functionality for her next paper. Therefore, we want to author a pull request that allows for this functionality. I see this has the label good first issue, so that must not be too difficult :)

From my first look, it seems that the problem is calculating the train score. Some basic question while looking at this. Is there any specific reason why auto-sklearn wants to do so? And why it enforces it? What would be the easy solution to remove this programmatically?

janvanrijn commented 5 years ago

When I bluntly remove the lines responsible for calculating the train score, things still work for most resampling strategies. However, for the sklearn.model_selection.TimeSeriesSplit a new problem pops up:

runfile('/vol/home/rijnjnvan/projects/openml-python/test_autosklearn.py', wdir='/vol/home/rijnjnvan/projects/openml-python')
/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:17: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Mapping, defaultdict
/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/pyparsing.py:2910: FutureWarning: Possible set intersection at position 3
  self.re = re.compile( self.reString )
/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py:197: RuntimeWarning: Mean of empty slice
  Y_train_pred = np.nanmean(Y_train_pred_full, axis=0)
[ERROR] [2019-02-27 15:58:43,478:AutoML(1):digits] Error creating dummy predictions: {'error': 'Result queue is empty', 'configuration_origin': 'DUMMY'} 
/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py:197: RuntimeWarning: Mean of empty slice
  Y_train_pred = np.nanmean(Y_train_pred_full, axis=0)
Traceback (most recent call last):

  File "<ipython-input-1-381c6806411e>", line 1, in <module>
    runfile('/vol/home/rijnjnvan/projects/openml-python/test_autosklearn.py', wdir='/vol/home/rijnjnvan/projects/openml-python')

  File "/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 786, in runfile
    execfile(filename, namespace)

  File "/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/vol/home/rijnjnvan/projects/openml-python/test_autosklearn.py", line 46, in <module>
    main()

  File "/vol/home/rijnjnvan/projects/openml-python/test_autosklearn.py", line 33, in main
    automl.fit(X_train.copy(), y_train.copy(), dataset_name='digits')

  File "/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/autosklearn/estimators.py", line 653, in fit
    dataset_name=dataset_name,

  File "/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/autosklearn/estimators.py", line 326, in fit
    self._automl[0].fit(**kwargs)

  File "/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/autosklearn/automl.py", line 989, in fit
    load_models=load_models,

  File "/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/autosklearn/automl.py", line 208, in fit
    only_return_configuration_space=only_return_configuration_space,

  File "/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/autosklearn/automl.py", line 487, in _fit
    _proc_smac.run_smbo()

  File "/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/autosklearn/smbo.py", line 502, in run_smbo
    smac.optimize()

  File "/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/smac/facade/smac_facade.py", line 400, in optimize
    incumbent = self.solver.run()

  File "/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/smac/optimizer/smbo.py", line 180, in run
    challengers = self.choose_next(X, Y)

  File "/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/smac/optimizer/smbo.py", line 248, in choose_next
    incumbent_value = self.runhistory.get_cost(self.incumbent)

  File "/vol/home/rijnjnvan/miniconda3/envs/auto-sklearn/lib/python3.7/site-packages/smac/runhistory/runhistory.py", line 271, in get_cost
    config_id = self.config_ids[config]

KeyError: None

It seems to go wrong on this block of the code:

    def choose_next(self, X: np.ndarray, Y: np.ndarray,
                    incumbent_value: float=None):
        """Choose next candidate solution with Bayesian optimization. The 
        suggested configurations depend on the argument ``acq_optimizer`` to
        the ``SMBO`` class.

        Parameters
        ----------
        X : (N, D) numpy array
            Each row contains a configuration and one set of
            instance features.
        Y : (N, O) numpy array
            The function values for each configuration instance pair.
        incumbent_value: float
            Cost value of incumbent configuration
            (required for acquisition function);
            if not given, it will be inferred from runhistory;
            if not given and runhistory is empty,
            it will raise a ValueError

        Returns
        -------
        Iterable
        """
        if X.shape[0] == 0:
            # Only return a single point to avoid an overly high number of
            # random search iterations
            return self._random_search.maximize(
                runhistory=self.runhistory, stats=self.stats, num_points=1
            )

        self.model.train(X, Y)

        if incumbent_value is None:
            if self.runhistory.empty():
                raise ValueError("Runhistory is empty and the cost value of "
                                 "the incumbent is unknown.")
            incumbent_value = self.runhistory.get_cost(self.incumbent)

        self.acquisition_func.update(model=self.model, eta=incumbent_value)

        challengers = self.acq_optimizer.maximize(
            self.runhistory, self.stats, 5000
        )
        return challengers

This happens in the SMAC package. Apparently, this function has as pre-condition that an incumbent is set. However, it isn't. Which part of the program is responsible for setting an incumbent? Surely not the resampling strategy. Where exactly is this incumbent set?

mfeurer commented 5 years ago
janvanrijn commented 5 years ago

Hi @mfeurer, today @wangcan04 and me had another look at this, and there are still problems. Would you mind clarifying your answer a bit more?

The training error is computed for logging reasons.

Cool, thanks. I can safely disable this I guess.

The error you are facing is because of the first run failing (which is a bug of the SMAC version used in Auto-sklearn). Sorry for this, I did not see this coming.

Any chance this behavior will change in the near future? Will the smac version be updated? Can we ourselves update to a newer version (locally) or will this open some other problems? we can always try, of course, which SMAC version fixed this bug?

I suggest you run the code via invoking the unit tests in test.test_evaluation, potentially in test_train_evaluator.py, then you can focus on the evaluator.

Sorry, I do not quite follow. Do you suggest adding this as a unit test? In which way would this allow us to focus on evaluator?

mfeurer commented 5 years ago

I can safely disable this I guess.

Yes, you can safely disable it.

Any chance this behavior will change in the near future?

No.

Will the smac version be updated?

Yes, but this will take time. Updating scikit-learn will come first.

Can we ourselves update to a newer version (locally) or will this open some other problems?

That'll take some effort on your side (and you'd have to fix the error message in SMAC, too).

Do you suggest adding this as a unit test? In which way would this allow us to focus on evaluator?

No. I suggest that you are using the existing unit tests to test how your function is invoked without using the full Auto-sklearn & SMAC machinery.

janvanrijn commented 5 years ago

Thanks for your quick reply.

As you might have guessed, today we picked up the problem again. Oddly enough, the behaviour of auto-sklearn changed (for the worse). The error seems to pop up at an earlier stage now. Running the code of @vincentmele results now in this error:

[ERROR] [2019-04-01 19:37:46,007:AutoML(1):digits] Error creating dummy predictions: {'error': 'Result queue is empty', 'configuration_origin': 'DUMMY'} 
Traceback (most recent call last):
  File "/home/janvanrijn/PycharmProjects/auto-sklearn/test-can.py", line 45, in <module>
    main()
  File "/home/janvanrijn/PycharmProjects/auto-sklearn/test-can.py", line 32, in main
    automl.fit(X_train.copy(), y_train.copy(), dataset_name='digits')
  File "/home/janvanrijn/projects/auto-sklearn/autosklearn/estimators.py", line 664, in fit
    dataset_name=dataset_name,
  File "/home/janvanrijn/projects/auto-sklearn/autosklearn/estimators.py", line 337, in fit
    self._automl[0].fit(**kwargs)
  File "/home/janvanrijn/projects/auto-sklearn/autosklearn/automl.py", line 996, in fit
    load_models=load_models,
  File "/home/janvanrijn/projects/auto-sklearn/autosklearn/automl.py", line 208, in fit
    only_return_configuration_space=only_return_configuration_space,
  File "/home/janvanrijn/projects/auto-sklearn/autosklearn/automl.py", line 384, in _fit
    num_run = self._do_dummy_prediction(datamanager, num_run)
  File "/home/janvanrijn/projects/auto-sklearn/autosklearn/automl.py", line 313, in _do_dummy_prediction
    raise ValueError("Dummy prediction failed: %s " % str(additional_info))
ValueError: Dummy prediction failed: {'error': 'Result queue is empty', 'configuration_origin': 'DUMMY'} 

I find this hard to debug, as the error line where this pops up is clearly not the culprit. In fact, it seems that the function ExecuteTaFuncWithQueue.run(...) has returned a status code CRASHED, which in turn is because autosklearn.evaluation.__init__.py makes a call to the Pynisher on line 210/211, and the result is an empty queue. I find the Pynisher a bit hard to debug. Is there any way I can increase the log level for the Pynisher?

mfeurer commented 5 years ago

Thanks for your feedback. As per #479 we adopted a fail as early as possible strategy. I agree that this is suboptimal to debug, @ahn1340 could you please check if it is possible to provide some more information here?

janvanrijn commented 5 years ago

I just got this suggestion from @mfeurer:

In this file we can disable the Pynisher on Line 90. Should make debugging easier

stefanandonov commented 4 years ago

Hi. I was trying to create models for time series using TimeSeriesSplit class as resampling_strategy and I've got the same error as you before. Does anyone have a solution to this problem. It would help me a lot. Thanks!

AnwarIbrahim9 commented 4 years ago

Hi, i was trying the time-series and got the same problem as above. So i spend some time debugging the code and found that the error cause is because of the Y_train_target & Y_train_pred_full in train_evaluator is initialized to np.nan.

However, unlike cross validation which will give the values to all the index, Time series will leave the last fold as it is which Nan and will cause the error (stop during assertion).

I hope my findings help.

AnwarIbrahim9 commented 4 years ago

Hi, @mfeurer @ahn1340. I tried to replace the NaN value to get the run working in evaluation\train_evalutor.py . I check and seems like it wont be affecting anything. Hope you guys can review it for me in case i miss anything.

train_evaluator fit_predict_and_loss.txt

basgalupp commented 4 years ago

Hi guys.

I'm passing through the same situation here. Did anyone find a solution? I tried to use the function provided by @AnwarIbrahim9 but it didn't work.

Best, Márcio

svenstehle commented 4 years ago

Hey there,

I am currently working on a problem that requires me to do walk forward eval on many different time series created by very similar processes. To my knowledge, sklearn does not include walk forward eval "splits" even for single time series.

Since I have to write a working function for that anyway I was wondering if I could contribute and if you folks in sklearn even want such a thing there? It could be implemented as a split generator like TimeSeriesSplit and could return the whole list of "SplitSets" as well, if so desired.

If you use it for a set of time series like I do, they should be created by similar process. Functionality/options could include training on already finished AND current time series and using current time series to predict on - OR using only current time series for training and prediction in the train/test set creation.

Additionally you can specify a minimum number of consecutive training samples to use for training. Series that don't meet this minimum requirement would be excluded. A frequency for this minimum number need also be specified to allow for correct selection of samples.

I also think using an optional sliding window (i.e.: how far back in the past do we use training samples) for the creation of the sets would be a great feature and offer much needed flexibility in testing when working with time series.

If this sounds already too bloated for a basic split functionality, I am open to go with something simple and doing a basic walk forward split function for just one time series with minimal training time and specified frequency. However, more and more prediction problems in real life are comprised of many time series created by similar processes and need to be addressed accordingly.

Would love to give something back and contribute and also get feedback and gain experience in open source, which I haven't done before.

If this is the wrong place for this, can you point me in a better direction?

Best, Sven

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs for the next 7 days. Thank you for your contributions.