Why is the first dimension `sequence_length` as opposed to `batch_size` in the input tensor

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Apache License 2.0

1.86k stars 308 forks source link

Why is the first dimension `sequence_length` as opposed to `batch_size` in the input tensor #195

Closed vgoklani closed 1 year ago

vgoklani commented 1 year ago

Why does the library use sequence_length as the first dimension of the input-tensor as opposed to the batch_size?

Is this just a choice of convention from RNNs or is the difference performance related?

From the example code:

bmm1 = torch.bmm(query.transpose(0, 1), key.transpose(0, 1).transpose(1, 2)) / self.norm_factor

https://github.com/NVIDIA/TransformerEngine/blob/main/docs/examples/quickstart_utils.py#L93

I see two successive transpose(0, 1) operations?

Thanks!

ksivaman commented 1 year ago

Having either batch size or sequence length as the first dimension of the input tensor can be beneficial for performance depending on the attention implementation and some other details. Our choice to stick with sequence length as the first dimension was so that we can add support for sequence parallelism. That being said, we will soon add a batch_first argument in our APIs that would optionally support input tensors with batch_size as their first dimension.

vgoklani commented 1 year ago

that makes sense, thank you!

bryangopal commented 10 months ago

Having either batch size or sequence length as the first dimension of the input tensor can be beneficial for performance depending on the attention implementation and some other details. Our choice to stick with sequence length as the first dimension was so that we can add support for sequence parallelism. That being said, we will soon add a batch_first argument in our APIs that would optionally support input tensors with batch_size as their first dimension.

any updates on the batch_first addition?

timmoon10 commented 10 months ago

The attention module has logic to handle multiple formats (e.g. SBHD, BSHD). See: https://github.com/NVIDIA/TransformerEngine/blob/666539f36275fa9c0fbc99f9ea50f2d6e29e336f/transformer_engine/pytorch/attention.py#L1821 However, we haven't exposed this in the Transformer layer yet. Pinging @cyanguwa.

bryangopal commented 9 months ago

@cyanguwa just following up, would appreciate an update!