Closed vgoklani closed 1 year ago
Having either batch size or sequence length as the first dimension of the input tensor can be beneficial for performance depending on the attention implementation and some other details. Our choice to stick with sequence length as the first dimension was so that we can add support for sequence parallelism. That being said, we will soon add a batch_first
argument in our APIs that would optionally support input tensors with batch_size
as their first dimension.
that makes sense, thank you!
Having either batch size or sequence length as the first dimension of the input tensor can be beneficial for performance depending on the attention implementation and some other details. Our choice to stick with sequence length as the first dimension was so that we can add support for sequence parallelism. That being said, we will soon add a
batch_first
argument in our APIs that would optionally support input tensors withbatch_size
as their first dimension.
any updates on the batch_first
addition?
The attention module has logic to handle multiple formats (e.g. SBHD, BSHD). See: https://github.com/NVIDIA/TransformerEngine/blob/666539f36275fa9c0fbc99f9ea50f2d6e29e336f/transformer_engine/pytorch/attention.py#L1821 However, we haven't exposed this in the Transformer layer yet. Pinging @cyanguwa.
@cyanguwa just following up, would appreciate an update!
Why does the library use
sequence_length
as the first dimension of the input-tensor as opposed to thebatch_size
?Is this just a choice of convention from RNNs or is the difference performance related?
From the example code:
I see two successive transpose(0, 1) operations?
Thanks!