Closed Hadar933 closed 1 year ago
You are correct that linear layers are applied to the last dimension of the input tensor (regardless of how many dimensions the tensor has). Hence, for this type of layers, wrapping them with TimeDistributed
has no effect, especially when it is called with return_reshaped=True
.
However, not only linear layers are wrapped with a TimeDistributed
module - there are also InputChannelEmbedding
modules that use it.
To keep things uniform, we wrapped the linear layers and the GLU
s as well - so that it would be clearer.
If I remember correctly the view
operations involved as part of the call for TimeDistributed
does not have performance implications on the operation of the linear layers. If you encounter different findings, please let me know.
Does it answer your question?
Ok, but under InputChannelEmbedding
, couldn't we simply change the forward
of NumericInputTransformation
(for example) to work on x[:,:,[i]]
instead of x[:,[i]]
, and it'll, by default, make TimeDistributed
redundant again?
Unless you're saying something along the lines of:
TimeDistributed
allows for a standardized way to apply any module to sequential data, regardless of whether that module was originally designed to handle 3D tensors. This means you can take any module that expects a 2D input and apply it to your sequential data without modifying the original module's code.
In that case - I'm all for it.
Thank you for answering, by the way.
Unless you're saying something along the lines of: TimeDistributed allows for a standardized way to apply any module to sequential data, regardless of whether that module was originally designed to handle 3D tensors. This means you can take any module that expects a 2D input and apply it to your sequential data without modifying the original module's code.
Exactly!
The TimeDistributed layer seems to be applying a module onto every time slice in the data. For example, for a given tensor
x
with shape(N,H,F
,TimeDistributed(nn.Linear(F,out))
applies a linear layer on every time stampt=1,2,...,H
. Having said that, the default behavior of linear is to apply a linear layer on every time slice. In other words, are we getting any other benefit aside from parallel computation with batchN * H
compared toN
?