Fix output transform, add test to enforce tokenizer consistency

amazon-science / chronos-forecasting

Chronos: Pretrained (Language) Models for Probabilistic Time Series Forecasting

https://arxiv.org/abs/2403.07815

Apache License 2.0

2.02k stars 238 forks source link

Fix output transform, add test to enforce tokenizer consistency #73

Closed HugoSenetaire closed 1 month ago

HugoSenetaire commented 1 month ago

Description of changes:

The bin indexes were shifted by one between input transform and output transform. Subtracting 1 to the sampled tokens in output transform lead to the correct reconstruction of the signal.

Add a test to ensure the consistency of the Chronos Tokenizer.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Co-authored-by: Lorenzo Stella stellalo@amazon.com and Abdul Fatir Ansari ansarnd@amazon.com

abdulfatir commented 1 month ago

Thank you for finding this issue @HugoSenetaire! 🚀

yoadsn commented 2 weeks ago

Sorry to bring this up from the dead - isn't this fix align with the original error which is how "centers" is defined in __init__? The linspace uses this: config.n_tokens - config.n_special_tokens - 1 Which seems like is missing "a center". for example for n_token = 3 and n_special_token = 1 you get "1" center - but I guess 2 tokens needs 2 centers? Otherwise - the boundaries on the next line would just be [-1E-20, 1E20] regardless of low_limit/high_limit.

I am probably misunderstanding this - but thought worth asking.

Otherwise, instead of fixing downstream, probably best to fix upstream?

lostella commented 2 weeks ago

@yoadsn indeed you're right, one bin is wasted in the current code. However, we realized that this is the bin configuration that was used during pretraining of the models we published on HuggingFace (trained using a different code base), so we had to stick with it to get the intended behaviour at prediction time. So it was really the output transformation that ended up being off-by-one, and one bin is unused.

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

(source)

yoadsn commented 2 weeks ago

I appreciate the clarification.