Predictions change with batch size of context

amazon-science / chronos-forecasting

Chronos: Pretrained (Language) Models for Probabilistic Time Series Forecasting

https://arxiv.org/abs/2403.07815

Apache License 2.0

2.02k stars 238 forks source link

Predictions change with batch size of context #126

Open anishsaha12 opened 1 week ago

anishsaha12 commented 1 week ago

First of all thanks for this great pretrained model!

I am however facing an issue wrt the consistency of the model's predictions with inputs of different batch sizes. Since this is essentially a univariate model, I expected it to have same predictions for the same input context for each time-series irrespective of the batch size.

Some context: My team is experimenting with Chronos to predict the yearly demand (365 days) for certain products within our company. I have fine-tuned the model with our internal data to have a context_length of 512 and prediction_length of 365. Now I am evaluating the performance of this model for our use-case.

To reproduce: (Sorry, cannot share the input data due to company policies, but including a scaled visualization of the context and forecast.) I create two batched inputs, one of size 8 $B_1$ the other of size 32 $B_2$. Here, $B_1 \subset B_2$ and I focus on the forecasts for one time-series $T_1$ contained in both the batches.

To produce the forecast I use (replacing the context (shape: [batch_size, 512]) with $B_1$ and $B_2$):

with torch.no_grad():
    transformers.set_seed(seed=seed)
    forecast = pipeline.predict(
        context,
        prediction_length=365,
        num_samples=20,
        temperature=2.0,
        top_k=200,
        top_p=1.0
    )
    low, median, high = np.quantile(forecast.numpy(), [0.1, 0.5, 0.9], axis=1)

The issue is that the predicted values (median) for $T_1$ from $B_1$ and $B_2$ are different but the input for this series is the same in both the batches. Showing the visualization of the context and predicted values using different batch size below.

Is there anything I am missing here or is this expected? Thanks!

abdulfatir commented 1 week ago

@anishsaha12 batch size should not affect predictions in any way. Could the difference be just due to randomness? Maybe it makes sense to also plot the prediction intervals to make sure there's a real difference.

anishsaha12 commented 1 week ago

Thanks for looking @abdulfatir. This is what the prediction intervals look like for the two cases. I think they look different.

The only thing I am changing between the two runs is the batch size. I use the same seed through transformers.set_seed( ) to avoid differences in randomness.

abdulfatir commented 1 week ago

transformers.set_seed() will only set the initial seed. However, the random state will not stay the same due to different batch sizes, so you should not expect the same results for the first 8 elements.
I just noticed that you're also using temperature=2.0 which softens the distribution further and it is indeed possible to draw samples that look like what you see above. I would remove the random seed and generate a few plots to see how random generations look like. Also, set temperature to a lower value.

If you still see differences, it would be great to have a minimum working example otherwise it is difficult to investigate whether there is a real issue. There should not be a difference in performance with different batch size.

anishsaha12 commented 1 week ago

Thanks @abdulfatir for your suggestions. I removed the random seed and generate a few plots using my internal data. The generations follow the same pattern generally but are still quite different. Also, I tried with temperature=1.0 which does diminish the difference but still maintains very different values for different batch sizes. As a side note, I would probably need to use temperature=2.0 for my data as the softer distribution helps capture the dramatic seasonality in the dataset I am using.

For the sake of reproducibility, I have created an artificial dataset and generated the forecasts for different batch sizes, using the "amazon/chronos-t5-small" checkpoint. Attaching this minimum working example.

The predictions are different with different batch sizes, even for the first 64 time-steps.

The attached example contains more details about the data and forecast evaluation.

Thanks again for taking a look!

abdulfatir commented 1 week ago

Thanks a lot for your effort! Honestly, I don't really see a meaningful difference in the 64 days visualization. The minor differences can be explained by the randomness. Just to clarify, when you set transformers.set_seed() initially, you should expect to see different predictions if you're using a different batch size. Consider this: you start from the same random state but at each forecast timestep you're drawing different number of random samples. For batch size 8, you're drawing 8 x num_samples at every step whereas for 32 you're drawing 32 x num_samples. This means that your random state will already be different after the first forecast step. If you want more stable median predictions, I would recommend using a larger value for num_samples. The differences in long horizon predictions could be an artifact of how predictions beyond 64 are done for the public models: the median prediction is used as the ground truth to unroll further.

anishsaha12 commented 1 week ago

Thanks for your explanation, it makes sense to me why we observe the minor differences for different batch sizes. Just wanted to clarify: yes the public model uses the prediction to unroll beyond 64 horizon. However I am using a fine-tuned model which has been trained with 365 horizon. Anyway, I understand this doesn't have much to do with the observed problem itself (just wanted to point it out).

Coming back to the original issue, I see that this is an unavoidable artifact. I'd like to go a step back and provide context on why I observed this. I fine-tuned the model a with batch size of 8, and used it for inference on hundreds of thousands of time-series. To speed up the process I used a batch size of 32, and observed this issue - a huge under-forecast of the seasonal peak (at ~200th time steps in the future).

Since this is a random process, I guess the mismatch is to be expected. Your suggestion regarding increasing the num_samples makes sense, but I am limited by my GPU memory bottleneck. Guess this is a trade-off I need to make - Wondering if you have any more suggestions on best practices for such practical scenarios, now that you have a little more context about my problem setup?

Regardless, this is a great paper and can do wonders on fine-tuning! Thanks!

abdulfatir commented 1 week ago

I'd like to go a step back and provide context on why I observed this. I fine-tuned the model a with batch size of 8, and used it for inference on hundreds of thousands of time-series. To speed up the process I used a batch size of 32, and observed this issue - a huge under-forecast of the seasonal peak (at ~200th time steps in the future).

Inference batch size has no relation with the training batch size and should have no effect. Looking at the series again, it does look like there is high uncertainty at the seasonal peak. Does that align with your understanding of the series? In this case, maybe it makes more sense to rely on the prediction intervals for downstream use rather than just the median?

I would also try reducing the temperature while increasing top_k.

Regardless, this is a great paper and can do wonders on fine-tuning! Thanks!

Thank you for your nice words. Happy to know that it works well. :)

anishsaha12 commented 1 week ago

Does that align with your understanding of the series?

The model predicts with high uncertainty at the seasonal peak. I don't think it necessarily aligns with the past observations of the series. This series in particular is actually very predictable and it has had the same trend and seasonality for the past 4 years.

rely on the prediction intervals for downstream use rather than just the median?

I am not so confident about using these prediction intervals. The forecasting model we currently use is just an ensemble of a few regression based models, and it achieves less that 2% APE. We were exploring Chronos to see if existing predictions can be improved (not just for this series but for the other thousands of series). My intuition is that since Chronos does not use any date features (or other exogenous / multivariate features) it is uncertain about the precise placement of the peak.

I would also try reducing the temperature while increasing top_k.

I tried fixing the temperature to a reduced value and checked the performance for many different top_k. Did not notice any strong correlation which can help me decide on a good value for these hyperparameters.