a little question about the vanilla Transformer`s output

LarsBentsen / FFTransformer

Multi-Step Spatio-Temporal Forecasting: https://authors.elsevier.com/sd/article/S0306-2619(22)01822-0

65 stars 16 forks source link

a little question about the vanilla Transformer`s output #5

Closed sunxiaoyao-git closed 1 year ago

sunxiaoyao-git commented 1 year ago

Hello, I have a quesiton about the output of transformer in your code. As I know, the decode`s input in transformer need the previous-step output of decode, and at each step the model is auto-regressive.
such as this line:[from http://nlp.seas.harvard.edu/annotated-transformer/#training-the-system]

But in your code , I see the vanilla trasnformer model just linear the full output ,not auto-regressive. I wonder if you use the method of Informers output to trasnfomer. Because in Informers paper, the author said "We propose generative style decoder to acquire long sequence output with only one forward step needed, simultaneously avoiding cumulative error spreading during the inference phase."

What`s your opinion?

LarsBentsen commented 1 year ago

Hi! So there are a few options for making multi-step predictions with an encoder-decoder Transformer. Some make predictions autoregressive as you say, while others use placeholders for the forecast locations and some different masking. In our implementation, we use placeholders for the forecast locations (i.e. 1, 6 or 24 steps) and apply linear transform on the outputs independently from the decoder to obtain the correct output dimension (similar to the ffn modules, just with a single linear transform instead). I believe this is the same method as the authors used in the original paper (Attention is All You Need) or the same as the Informer implementation (https://github.com/zhouhaoyi/Informer2020/tree/main). Depending on the application, one might choose different methods. I don't know if this helped at all with your understanding, but please let me know if you have any specific questions on how you would implement the different methods or other clarifications. :)

sunxiaoyao-git commented 1 year ago

If I can understand it that way: when transformer is used in time series prediction whether one-step or multi-step, the final softmax layer is discarded. It`s just the point forecasting . Another application about time series is the probabilistic forecasting , in another way of transformer

LarsBentsen commented 1 year ago

Yes, you are correct that the softmax layer will not be used for forecasting (or regression) applications. The linear layer is used to produce the correct output dimensionality. Some applications also only use the encoder, as for the graph-based models in this study. The critical component of the Transformer is the full multi-head attention mechanism with subsequent FFN layers. Various alterations do exist, where we here use a similar encoder-decoder architecture as in the image above for the models/Transformer.py script, but alleviate the decoder for the models/GraphTransformer.py model.

sunxiaoyao-git commented 1 year ago

I get it Thank you very much!