[ENH] Revise test_all_forecasters

Describe the feature or idea you want to propose

currently test_all_forecasters takes 45 minutes on my machine (down from over an hour after https://github.com/aeon-toolkit/aeon/pull/568). I think two further things to look at are

Splitting tests that test all combinations into single tests and
Combining tests to reduce the number of calls to fit (as described in #418 for all estimators)

list of tests in test_all_forecasters

Describe your proposed solution

Tests in test all forecasters. There are a lot of todo, which have probably been there a long time.

test_get_fitted_params: fits all estimators, tests if get_fitted_params returns a dict. This should be done in test_all_estimators and combined with another test
test_raises_not_fitted_error: calls forecasting specific update and update_predict. In principle fine, but also creates a cv sliding window
```
        cv = SlidingWindowSplitter(fh=1, window_length=1, start_with_window=False)
        estimator_instance.update_predict(y_test, cv=cv)
```
not sure if this is required

test_y_multivariate_raises_error seems fine, but not sure why it has this tagged at the end after fit has been called, it seems pointless to me, being directly before the end of the function

    if estimator_instance.get_tag("scitype:y") in ["univariate", "both"]:
        # this should pass since "both" allows any number of variables
        # and "univariate" automatically vectorizes, behaves multivariate
        pass

test_y_invalid_type_raises_error: Test that invalid y input types raise error. tested on two invalid types, which seems excessive
```
INVALID_y_INPUT_TYPES = [list("bar"), tuple()]
```
test_X_invalid_type_raises_error. Test that invalid X input types raise error. tested on two invalid types, which seems excessive
```
INVALID_X_INPUT_TYPES = [list("foo"), tuple()]
```
test_predict_residuals: Check that predict_residuals method works as expected.
```
@pytest.mark.parametrize(
    "index_fh_comb", VALID_INDEX_FH_COMBINATIONS, ids=index_fh_comb_names
)
@pytest.mark.parametrize("fh_int", TEST_FHS, ids=[f"fh={fh}" for fh in TEST_FHS])
```
this is a slow one I think, tested on a lot of combinations (cant quite work out how many!). It seems to test correct indexes returned _assert_correct_pred_time_index it calls fit, predict and predict_residuals
test_predict_time_index_with_X: Check that predicted time index matches forecasting horizon. Called 27 times per forecaster (the 9 valid index FH and 3 others), approx 700 times in total

another big one, using the nine VALID_INDEX_FH_COMBINATIONS in combination with other things

    @pytest.mark.parametrize(
        "index_fh_comb", VALID_INDEX_FH_COMBINATIONS, ids=index_fh_comb_names
    )
    @pytest.mark.parametrize(
        "fh_int_oos", TEST_OOS_FHS, ids=[f"fh={fh}" for fh in TEST_OOS_FHS]
    )

it checks _assert_correct_pred_time_index and _assert_correct_columns

test_predict_time_index_in_sample_full: "Check that predicted time index equals fh for full in-sample predictions". Called a whopping 54 times per estimator, (the 9 valid index FH and 6 others), 1650 times in total
```
@pytest.mark.parametrize(
    "index_fh_comb", VALID_INDEX_FH_COMBINATIONS, ids=index_fh_comb_names
)
so could in principle be subsumed with another test 
```

test_predict_interval: "Check prediction intervals returned by predict Another slow test, 24 per estimator

@pytest.mark.parametrize(
    "coverage", TEST_ALPHAS, ids=[f"alpha={a}" for a in TEST_ALPHAS]
)
@pytest.mark.parametrize(
    "fh_int_oos", TEST_OOS_FHS, ids=[f"fh={fh}" for fh in TEST_OOS_FHS]
)

seems quite a weird test

        pred_ints = estimator_instance.predict_interval(
            fh_int_oos, coverage=coverage
        )
        valid, msg, _ = check_is_mtype(
            pred_ints, mtype="pred_interval", scitype="Proba", return_metadata=True
        )  # type: ignore
        assert valid, msg

test_predict_quantiles: Check prediction quantiles returned by predict Called 27 times per forecaster (the 9 valid index FH and 3 others), approx 700 times in total
```
@pytest.mark.parametrize(
    "alpha", TEST_ALPHAS, ids=[f"alpha={a}" for a in TEST_ALPHAS]
)
@pytest.mark.parametrize(
    "fh_int_oos", TEST_OOS_FHS, ids=[f"fh={fh}" for fh in TEST_OOS_FHS]
)
```
calls predict_quantiles compares output to the rather confusing _check_predict_quantiles
test_pred_int_tag: Checks whether the capability:pred_int tag is correctly set no paras, seems to test if a method has been implemented.

test_score: Check score method

@pytest.mark.parametrize(
    "fh_int_oos", TEST_OOS_FHS, ids=[f"fh={fh}" for fh in TEST_OOS_FHS]
)

is it necessary to test so many combinations? test_update_predict_single

@pytest.mark.parametrize(
    "fh_int_oos", TEST_OOS_FHS, ids=[f"fh={fh}" for fh in TEST_OOS_FHS]
)

can combine with 12?

test_update_predict_predicted_index: Check predicted index in update_predict. Called a massive 72 times per estimator, over 2000 times in total

    @pytest.mark.parametrize(
        "fh_int_oos", TEST_OOS_FHS, ids=[f"fh={fh}" for fh in TEST_OOS_FHS]
    )

these functions have multiple arguments, e.g.

    def test_update_predict_predicted_index(
        self,
        estimator_instance,
        n_columns,
        fh_int_oos,
        step_length,
        initial_window,
        update_params,
    ):

not sure where these are set

test__y_and_cutoff: Check cutoff and _y. (not the most informative comment)
```
def test__y_and_cutoff(self, estimator_instance, n_columns):
```
where is n_columns set?
test__y_when_refitting: Test that _y is updated when forecaster is refitted

combine with 14?

test_fh_attribute


        f.fit(y_train, fh=FH0)
        np.testing.assert_array_equal(f.fh, FH0)
        f.predict()
        np.testing.assert_array_equal(f.fh, FH0)
        f.predict(FH0)
        np.testing.assert_array_equal(f.fh, FH0)

surely the two calls to predict is redundant?

test_fh_not_passed_error_handling. Check that not passing fh in fit/predict raises correct error. seems fine
test_different_fh_in_fit_and_predict_error_handling: Check that fh different in fit and predict raises correct error. combine with 17?
test_hierarchical_with_exogeneous: Check that hierarchical forecasting works, also see bug #3961. wonder what bug #3961 is?

Describe alternatives you've considered, if relevant

No response

Additional context

No response

aeon-toolkit / aeon