[Temporal Fusion Transformer (TFT)] Large dynamic ranges: Only time series with large values are trained. FP64 needed?

Pyroluk commented 2 years ago

Related to Temporal Fusion Transformer (TFT) in Time-Series Prediction Platform (TSPP)

Use case: Predict time series of rankings of items within categories. For example, item 1 occupies place 10 of the most viewed items in category A. Item 2 occupies place 100,000 of the most viewed items in category A and place 500 in category B. The places shift freely with each day. Item 2 could occupy place 10 of category A the next week.

Describe the bug

Over half a million of concurrent time series
Large dynamic ranges
Only time series with large target values are trained

When training over half a million of concurrent time series, which have vastly different dynamic ranges, only time series with comparably large values are trained. For example, time series with values from 10,000 to 500,000 are trained perfectly well with very good predictions, but for time series with values between 0 and 500 the predictions are far off, look alike, and seem to not be specific to any of the time series.

What I tried:

Training with reciprocal values turns this behavior around. Then, time series with small values are trained but not time series with large values.
The scale_per_id feature of TSPP, which scales every time series separately, resulted in predictions for all time series (small and large) to look like straight lines close to the average values of each time series.
Modifying the loss function, scaling the loss larger for small target values, resulted in a mixed bag, where neither time series with small nor time series with large values were trained adequately.
When training with AMP enabled, gradients explode after ~14 Epochs (loss nan). Training without AMP has been running for over 60 Epochs.

All this leads me to believe, that the dynamic range of FP32 might be too small in some parts of the network to represent the large dynamic ranges of my use case. If that is the case, which parts of the network would need to use FP64 instead? I used a network with n_head: 10, hidden_size: 320, dropout: 0.1 and attn_dropout: 0.01. Is this to small for my use case?

To Reproduce

Generate time series with vastly different dynamic ranges
Train TFT network with TSPP
Inspect predictions of time series with only small and time series with only large values. Only large values are trained.

Expected behavior Adequate predictions of values for all time series, not just time series with comparably large values.

Environment

Container version: pytorch:22.04-py3
GPUs in the system: 2x V100
CUDA driver version: 510.85.02

jbaczek commented 2 years ago

If I understand correctly you have a problem with encoding the dataset properly to work with TFT. This is not a bug in the TSPP nor in the TFT. TFT by design works with standardized data. Your question is more about how to make your dataset interpretable by deep learning model. You have to ask yourself couple questions before you start working with deep learning models:

Does variables in your data have some inherent ordering? If not then maybe categorical encoding will be better for these varaibles.
Is the difference between 1st and 2nd place as meaningful as the difference between 50001 and 50002? If not then logarithmic normalization might help.
Do you want to predict this variable at all? Sometimes it is better to bin variables if there is premise that small differences won't affect the output greatly.
Did you talk with experienced data scientist to help you resolve this problem? Maybe TFT is not the best tool if you really need to forecast millions of series.

As for AMP, yes it diverges more often than FP32 training. This happens often because of incorrectly chosen hyperparameters, faulty model or incorrectly prepared data. TFT has been extensively tested for stability with AMP and the results have shown, that for the hyperparameters used in the original paper it is stable. Moreover TSPP uses even more conservative casting schema than out standalone implementation of TFT. No deep learning model (at least any known to me) needs FP64 dynamic range. Either way performing computation in FP64 is prohibitively slow.

Pyroluk commented 2 years ago

Thank you for taking time out of your day to have a look on my problem.

I can happily report, that I solved it by fixing my externalized standard scaling code. A bug caused the avg output. Using the build-in scale_per_id feature of TSPP was taking way too long (days) just preprocessing. I could improve the speed quite a bit, but thanks to python, there is a hard limit on what is possible without rewriting everything scaler related moving it to C/C++/Pandas or similar. For my use case, I moved the scaling over to SQL, reducing the time from days of preprocessing to ~5 minutes (single threaded!).

The predictions look very good, even tough half a million time series are predicted at once. Would you consider adding support for scaling this many time series in TSPP? If yes, I can open a feature request for it, in case that helps in some way.

Regarding AMP, I still have the problem of early diverges. I will try the hyperparameters from the paper next.

jbaczek commented 2 years ago

We are aware of the problem with preprocessing speed and we track it on our side. Some improvements will come in the next release. If your gradients explode try to use gradient clipping. You can enable it by appending +trainer.config.gradient_norm=<max_norm> to your command. Hyperparameters from the paper are good for the datasets featured in the paper. To find what set suits your cause the best use optuna plugin from hydra package. Here is an example how to run it.

NVIDIA / DeepLearningExamples

[Temporal Fusion Transformer (TFT)] Large dynamic ranges: Only time series with large values are trained. FP64 needed? #1222