aeon-toolkit / aeon

A toolkit for machine learning from time series
https://aeon-toolkit.org/
BSD 3-Clause "New" or "Revised" License
966 stars 112 forks source link

[ENH] Revise test_all_forecasters #574

Closed TonyBagnall closed 4 months ago

TonyBagnall commented 1 year ago

Describe the feature or idea you want to propose

currently test_all_forecasters takes 45 minutes on my machine (down from over an hour after https://github.com/aeon-toolkit/aeon/pull/568). I think two further things to look at are

  1. Splitting tests that test all combinations into single tests and
  2. Combining tests to reduce the number of calls to fit (as described in #418 for all estimators)

list of tests in test_all_forecasters

Describe your proposed solution

Tests in test all forecasters. There are a lot of todo, which have probably been there a long time.

  1. test_get_fitted_params: fits all estimators, tests if get_fitted_params returns a dict. This should be done in test_all_estimators and combined with another test

  2. test_raises_not_fitted_error: calls forecasting specific update and update_predict. In principle fine, but also creates a cv sliding window

            cv = SlidingWindowSplitter(fh=1, window_length=1, start_with_window=False)
            estimator_instance.update_predict(y_test, cv=cv)

    not sure if this is required

  3. test_y_multivariate_raises_error seems fine, but not sure why it has this tagged at the end after fit has been called, it seems pointless to me, being directly before the end of the function

        if estimator_instance.get_tag("scitype:y") in ["univariate", "both"]:
            # this should pass since "both" allows any number of variables
            # and "univariate" automatically vectorizes, behaves multivariate
            pass
  4. test_y_invalid_type_raises_error: Test that invalid y input types raise error. tested on two invalid types, which seems excessive

    INVALID_y_INPUT_TYPES = [list("bar"), tuple()]
  5. test_X_invalid_type_raises_error. Test that invalid X input types raise error. tested on two invalid types, which seems excessive

    INVALID_X_INPUT_TYPES = [list("foo"), tuple()]
  6. test_predict_residuals: Check that predict_residuals method works as expected.

    @pytest.mark.parametrize(
        "index_fh_comb", VALID_INDEX_FH_COMBINATIONS, ids=index_fh_comb_names
    )
    @pytest.mark.parametrize("fh_int", TEST_FHS, ids=[f"fh={fh}" for fh in TEST_FHS])

    this is a slow one I think, tested on a lot of combinations (cant quite work out how many!). It seems to test correct indexes returned _assert_correct_pred_time_index it calls fit, predict and predict_residuals

  7. test_predict_time_index_with_X: Check that predicted time index matches forecasting horizon. Called 27 times per forecaster (the 9 valid index FH and 3 others), approx 700 times in total

another big one, using the nine VALID_INDEX_FH_COMBINATIONS in combination with other things

    @pytest.mark.parametrize(
        "index_fh_comb", VALID_INDEX_FH_COMBINATIONS, ids=index_fh_comb_names
    )
    @pytest.mark.parametrize(
        "fh_int_oos", TEST_OOS_FHS, ids=[f"fh={fh}" for fh in TEST_OOS_FHS]
    )

it checks _assert_correct_pred_time_index and _assert_correct_columns

  1. test_predict_time_index_in_sample_full: "Check that predicted time index equals fh for full in-sample predictions". Called a whopping 54 times per estimator, (the 9 valid index FH and 6 others), 1650 times in total

    @pytest.mark.parametrize(
        "index_fh_comb", VALID_INDEX_FH_COMBINATIONS, ids=index_fh_comb_names
    )
    so could in principle be subsumed with another test 
  2. test_predict_interval: "Check prediction intervals returned by predict Another slow test, 24 per estimator

    @pytest.mark.parametrize(
        "coverage", TEST_ALPHAS, ids=[f"alpha={a}" for a in TEST_ALPHAS]
    )
    @pytest.mark.parametrize(
        "fh_int_oos", TEST_OOS_FHS, ids=[f"fh={fh}" for fh in TEST_OOS_FHS]
    )

    seems quite a weird test

            pred_ints = estimator_instance.predict_interval(
                fh_int_oos, coverage=coverage
            )
            valid, msg, _ = check_is_mtype(
                pred_ints, mtype="pred_interval", scitype="Proba", return_metadata=True
            )  # type: ignore
            assert valid, msg
  3. test_predict_quantiles: Check prediction quantiles returned by predict Called 27 times per forecaster (the 9 valid index FH and 3 others), approx 700 times in total

    
    @pytest.mark.parametrize(
        "alpha", TEST_ALPHAS, ids=[f"alpha={a}" for a in TEST_ALPHAS]
    )
    @pytest.mark.parametrize(
        "fh_int_oos", TEST_OOS_FHS, ids=[f"fh={fh}" for fh in TEST_OOS_FHS]
    )

    calls predict_quantiles compares output to the rather confusing _check_predict_quantiles

  4. test_pred_int_tag: Checks whether the capability:pred_int tag is correctly set no paras, seems to test if a method has been implemented.

  5. test_score: Check score method

    @pytest.mark.parametrize(
        "fh_int_oos", TEST_OOS_FHS, ids=[f"fh={fh}" for fh in TEST_OOS_FHS]
    )

    is it necessary to test so many combinations? test_update_predict_single

    @pytest.mark.parametrize(
        "fh_int_oos", TEST_OOS_FHS, ids=[f"fh={fh}" for fh in TEST_OOS_FHS]
    )

    can combine with 12?

  6. test_update_predict_predicted_index: Check predicted index in update_predict. Called a massive 72 times per estimator, over 2000 times in total

    @pytest.mark.parametrize(
        "fh_int_oos", TEST_OOS_FHS, ids=[f"fh={fh}" for fh in TEST_OOS_FHS]
    )

these functions have multiple arguments, e.g.

    def test_update_predict_predicted_index(
        self,
        estimator_instance,
        n_columns,
        fh_int_oos,
        step_length,
        initial_window,
        update_params,
    ):

not sure where these are set

  1. test__y_and_cutoff: Check cutoff and _y. (not the most informative comment)

    def test__y_and_cutoff(self, estimator_instance, n_columns):

    where is n_columns set?

  2. test__y_when_refitting: Test that _y is updated when forecaster is refitted

combine with 14?

  1. test_fh_attribute

        f.fit(y_train, fh=FH0)
        np.testing.assert_array_equal(f.fh, FH0)
        f.predict()
        np.testing.assert_array_equal(f.fh, FH0)
        f.predict(FH0)
        np.testing.assert_array_equal(f.fh, FH0)

surely the two calls to predict is redundant?

  1. test_fh_not_passed_error_handling. Check that not passing fh in fit/predict raises correct error. seems fine

  2. test_different_fh_in_fit_and_predict_error_handling: Check that fh different in fit and predict raises correct error. combine with 17?

  3. test_hierarchical_with_exogeneous: Check that hierarchical forecasting works, also see bug #3961. wonder what bug #3961 is?

Describe alternatives you've considered, if relevant

No response

Additional context

No response

TonyBagnall commented 4 months ago

closing this now, since testing has transformed significantly and its no longer an issue