Open Pyroluk opened 2 years ago
If I understand correctly you have a problem with encoding the dataset properly to work with TFT. This is not a bug in the TSPP nor in the TFT. TFT by design works with standardized data. Your question is more about how to make your dataset interpretable by deep learning model. You have to ask yourself couple questions before you start working with deep learning models:
As for AMP, yes it diverges more often than FP32 training. This happens often because of incorrectly chosen hyperparameters, faulty model or incorrectly prepared data. TFT has been extensively tested for stability with AMP and the results have shown, that for the hyperparameters used in the original paper it is stable. Moreover TSPP uses even more conservative casting schema than out standalone implementation of TFT. No deep learning model (at least any known to me) needs FP64 dynamic range. Either way performing computation in FP64 is prohibitively slow.
Thank you for taking time out of your day to have a look on my problem.
I can happily report, that I solved it by fixing my externalized standard scaling code. A bug caused the avg output. Using the build-in scale_per_id feature of TSPP was taking way too long (days) just preprocessing. I could improve the speed quite a bit, but thanks to python, there is a hard limit on what is possible without rewriting everything scaler related moving it to C/C++/Pandas or similar. For my use case, I moved the scaling over to SQL, reducing the time from days of preprocessing to ~5 minutes (single threaded!).
The predictions look very good, even tough half a million time series are predicted at once. Would you consider adding support for scaling this many time series in TSPP? If yes, I can open a feature request for it, in case that helps in some way.
Regarding AMP, I still have the problem of early diverges. I will try the hyperparameters from the paper next.
We are aware of the problem with preprocessing speed and we track it on our side. Some improvements will come in the next release.
If your gradients explode try to use gradient clipping. You can enable it by appending +trainer.config.gradient_norm=<max_norm>
to your command.
Hyperparameters from the paper are good for the datasets featured in the paper. To find what set suits your cause the best use optuna plugin from hydra package. Here is an example how to run it.
Related to Temporal Fusion Transformer (TFT) in Time-Series Prediction Platform (TSPP)
Use case: Predict time series of rankings of items within categories. For example, item 1 occupies place 10 of the most viewed items in category A. Item 2 occupies place 100,000 of the most viewed items in category A and place 500 in category B. The places shift freely with each day. Item 2 could occupy place 10 of category A the next week.
Describe the bug
When training over half a million of concurrent time series, which have vastly different dynamic ranges, only time series with comparably large values are trained. For example, time series with values from 10,000 to 500,000 are trained perfectly well with very good predictions, but for time series with values between 0 and 500 the predictions are far off, look alike, and seem to not be specific to any of the time series.
What I tried:
All this leads me to believe, that the dynamic range of FP32 might be too small in some parts of the network to represent the large dynamic ranges of my use case. If that is the case, which parts of the network would need to use FP64 instead? I used a network with n_head: 10, hidden_size: 320, dropout: 0.1 and attn_dropout: 0.01. Is this to small for my use case?
To Reproduce
Expected behavior Adequate predictions of values for all time series, not just time series with comparably large values.
Environment