Possible redundant TimeDistributed Layer?

PlaytikaOSS / tft-torch

A Python library that implements ״Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting״

MIT License

109 stars 17 forks source link

Possible redundant TimeDistributed Layer? #11

Closed Hadar933 closed 1 year ago

Hadar933 commented 1 year ago

The TimeDistributed layer seems to be applying a module onto every time slice in the data. For example, for a given tensor x with shape (N,H,F, TimeDistributed(nn.Linear(F,out)) applies a linear layer on every time stamp t=1,2,...,H. Having said that, the default behavior of linear is to apply a linear layer on every time slice. In other words, are we getting any other benefit aside from parallel computation with batch N * H compared to N?

Dvirbeno commented 1 year ago

You are correct that linear layers are applied to the last dimension of the input tensor (regardless of how many dimensions the tensor has). Hence, for this type of layers, wrapping them with TimeDistributed has no effect, especially when it is called with return_reshaped=True. However, not only linear layers are wrapped with a TimeDistributed module - there are also InputChannelEmbedding modules that use it. To keep things uniform, we wrapped the linear layers and the GLUs as well - so that it would be clearer. If I remember correctly the view operations involved as part of the call for TimeDistributed does not have performance implications on the operation of the linear layers. If you encounter different findings, please let me know.

Does it answer your question?

Hadar933 commented 1 year ago

Ok, but under InputChannelEmbedding, couldn't we simply change the forward of NumericInputTransformation (for example) to work on x[:,:,[i]] instead of x[:,[i]], and it'll, by default, make TimeDistributed redundant again?

Unless you're saying something along the lines of: TimeDistributed allows for a standardized way to apply any module to sequential data, regardless of whether that module was originally designed to handle 3D tensors. This means you can take any module that expects a 2D input and apply it to your sequential data without modifying the original module's code. In that case - I'm all for it.

Thank you for answering, by the way.

Dvirbeno commented 1 year ago

Unless you're saying something along the lines of: TimeDistributed allows for a standardized way to apply any module to sequential data, regardless of whether that module was originally designed to handle 3D tensors. This means you can take any module that expects a 2D input and apply it to your sequential data without modifying the original module's code.

Exactly!