Closed rottenivy closed 2 years ago
Hi, It's true that the prediction loop is different for the Transformer models vs LSTM. LSTM generates each timestep of the output in sequence while the Transformers predict every timestep in one forward pass. We set this up by passing decoder inputs as zeros and turn off decoder masking so that the predictions can share information. The code makes that a little confusing because there are mask options that are disabled. We also pass the true dataset target values into the training step but replace them with zeros before they go into the decoder, so inference does not require ground truth data.
Because all the tokens are zero there is no information leak from future timesteps and we can skip the mask. I'm working on some changes to make inference simpler by creating the zero sequence automatically.
Hi authors, thanks for sharing the code of this paper. I have a question when trying to implement these models. It seems like the multiple-step ahead prediction is treated in different ways for the PeMS dataset.
For example, in the LSTM model, the output of the decoder at each timestep is used as the decoder input to generate the prediction at the next step, meaning we don't use any ground truth data at the inference step. However, in the spacetimeformer model, we simply use a mask to mask out future positions but we still use ground truth data as the decoder input. Essentially, I think it's equivalent to a rolling 1-step prediction that is different from the multiple-step ahead prediction in the LSTM model.