Nixtla / hierarchicalforecast

Probabilistic Hierarchical forecasting πŸ‘‘ with statistical and econometric methods.
https://nixtlaverse.nixtla.io/hierarchicalforecast
Apache License 2.0
588 stars 76 forks source link

[FEAT] Efficient Schafer-Strimmer for MinT #280

Closed elephaint closed 2 months ago

elephaint commented 3 months ago

This PR implements an optimized version of the Schafer-Strimmer shrunk empirical covariance algorithm for MinT.

1. Up to 60x faster

First, the result is that we can perform MinT reconciliation much faster, as demonstrated in the table below, where mint_legacy is the old version and mint is the new version, with the number in brackets denoting whether the forecasts contain NaNs (True or False) and the number of bottom-level timeseries reconciled (i.e., 20, 200, 2000). An improvement of up to 60x (on my machine, i7-12700k) is seen (no NaNs, 2000 timeseries).

Note that the NaN-version is about 3x slower than the version that doesn't have to deal with NaNs, which is due to all the masking involved when dealing with NaNs. However, it is still significantly faster than the legacy version (that doesn't handle NaNs properly).

Screenshot 2024-09-02 113916

2. Forecasting performance

Forecasting performance is similar:

Screenshot 2024-09-02 113002

3. Improved NaN handling

This PR improves NaN-handling (thanks to @christophertitchen for the suggestions!), leading to more cases that MinT can handle. For example, introducing 20% NaN values in the Australian Tourism forecasts couldn't be handled by the old version but the new version will give results nearly identical to the no-nan case:

Screenshot 2024-09-02 113218

Code for introducing NaNs in the Australian Tourism example:

# Randomly set NaNs
nan_fraction = 0.2
Y_fitted_df = Y_fitted_df.reset_index()
Y_fitted_df_sample_idx = Y_fitted_df.sample(frac=nan_fraction).index
Y_fitted_df.loc[Y_fitted_df_sample_idx, ["y", "ETS"]] = np.nan
Y_fitted_df = Y_fitted_df.set_index("unique_id")
review-notebook-app[bot] commented 3 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

elephaint commented 3 months ago

@christophertitchen Again thanks for the remarks, I made a few other micro optimizations, in the end in total further reducing computation time by ~30%

elephaint commented 2 months ago

I am happy the comments were helpful!

The only further change I thought of would be to potentially modify the naΓ―ve approach when computing covariance and correlation to a pairwise approach for each time series, like what is done in stats::cov for R. This could better describe the relationship between pairs of time series, but honestly, it would slow down the generation of the shrunk W and go against the spirit of what you were trying to achieve here by implementing a fast method, so I think it is perfectly fine as you have done it.

As a quick example to demonstrate, imagine monthly time series representing the units sold of products over the past six months. We take two of those products, X and Y , with Y being a NPI (new product introduction) which was implemented only four months prior. The residuals of a naΓ―ve forecast would be:

X = [ ( x t βˆ’ 6 βˆ’ x t βˆ’ 7 ) , ( x t βˆ’ 5 βˆ’ x t βˆ’ 6 ) , ( x t βˆ’ 4 βˆ’ x t βˆ’ 5 ) , ( x t βˆ’ 3 βˆ’ x t βˆ’ 4 ) , ( x t βˆ’ 2 βˆ’ x t βˆ’ 3 ) , ( x t βˆ’ 1 βˆ’ x t βˆ’ 2 ) ]

Y = [ βˆ’ , βˆ’ , βˆ’ , ( y t βˆ’ 3 βˆ’ y t βˆ’ 4 ) , ( y t βˆ’ 2 βˆ’ y t βˆ’ 3 ) , ( y t βˆ’ 1 βˆ’ y t βˆ’ 2 ) ]

This approach would use an n_samples of 6 and the the mean of X would include all six observations. However, a pairwise approach would only use pairs of temporally aligned observations, so an n_samples of 3 , and the mean of X would be X ― = ( ( x t βˆ’ 3 βˆ’ x t βˆ’ 4 ) + ( x t βˆ’ 2 βˆ’ x t βˆ’ 3 ) + ( x t βˆ’ 1 βˆ’ x t βˆ’ 2 ) ) / 3 , which would obviously lead to a different cov ( X , Y ) , corr ( X , Y ) , Οƒ x for standardisation and so on.

not_mask = ~np.isnan(residuals)
...
for i in prange(n_timeseries):
    x = residuals[i]
    for j in range(i + 1):
        y = residuals[j]
        # Get the temporally aligned pairs of observations.
        mask = not_mask[i] & not_mask[j]
        n_samples = len(mask)
        # Check for insufficient observations for unbiased estimator.
        if n_samples - 1 == 0:
            ...
        x = x[mask]
        y = y[mask]
        # Calculate the pairwise sample means.
        mean_x = np.mean(x)
        mean_y = np.mean(y)
        ...

Anyway, I am not familiar with any literature on this topic or if there is any evidence to suggest it is significantly worse, and the legacy version also takes the same approach, so as I said, I favour the fast approach you have implemented here. Good job!

Didn't think about this, but it's a good point. I'll have a try and play around a bit with it :)

elephaint commented 2 months ago

@christophertitchen Thanks for the suggestion for the temporal alignment - I included two versions, one that can handle NaNs in the way you describe and a faster one that doesn't have to deal with NaNs. The NaN-version is a bit slower (a.o. due to the masking involved) but it enlarges the scope of MinT significantly, as it can handle much more cases without leading to ill-conditioned W matrices (see example in the top post). Thanks!

elephaint commented 2 months ago

@christophertitchen Again thanks for the thoughtful comments.

I think I'll be deprecating the legacy version now that I get equivalent results across datasets.

christophertitchen commented 2 months ago

@christophertitchen Again thanks for the thoughtful comments.

I think I'll be deprecating the legacy version now that I get equivalent results across datasets.

@elephaint great job!

Also, regarding Hyndman's implementation, you were right.

https://github.com/earowang/hts/blob/1408ab2c5c40f1022e6957bcf8438aaefc8464bf/R/MinT.R#L24

  covm <- crossprod(x) / n
  v <- (1/(n * (n - 1))) * (crossprod(xs^2) - 1/n * (crossprod(xs))^2)

It uses the biased estimator, i.e. normalisation by $N$. We could have changed ddof in np.ma.cov to $0$ to copy this approach, but deprecating the legacy approach in favour of your superior one makes sense. πŸ˜‰