add T5 positional encoding

sdtblck commented 3 years ago

Actually reopening this because T5 pos enc is a learned encoding in the attention layers, not the same as sinusoidal pos emb

MicPie commented 3 years ago

I pushed a first draft to: https://github.com/EleutherAI/megatron-3d/commit/25be7136821cefdd375fc458f078e0ea48ded7dd

MicPie commented 3 years ago

Just some documentation:

FusedScaleMaskSoftmax https://github.com/EleutherAI/megatron-3d/blob/main/megatron/model/transformer.py#L211 Original function: https://github.com/EleutherAI/megatron-3d/blob/main/megatron/model/fused_softmax.py#L74

Just added rpe before the softmax.

SparseSelfAttention https://github.com/EleutherAI/megatron-3d/blob/main/megatron/model/transformer.py#L206 Original function: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/ops/sparse_attention/sparse_self_attention.py

This function can already take in the computed rpe.

They are then getting handed over to the https://github.com/microsoft/DeepSpeed/blob/ec8b1cb0a0a5752bba029da4bdc91616c0f5bec7/deepspeed/ops/sparse_attention/softmax.py#L219 and further to: https://github.com/microsoft/DeepSpeed/blob/ec8b1cb0a0a5752bba029da4bdc91616c0f5bec7/deepspeed/ops/sparse_attention/softmax.py#L17 and then gets handled here: https://github.com/microsoft/DeepSpeed/blob/ec8b1cb0a0a5752bba029da4bdc91616c0f5bec7/deepspeed/ops/sparse_attention/softmax.py#L126

Tracking down the rpe kwarg in the sparse_attention to check if it really is the T5 rpe: The rpe handover from python to triton is here: https://github.com/microsoft/DeepSpeed/blob/ec8b1cb0a0a5752bba029da4bdc91616c0f5bec7/deepspeed/ops/sparse_attention/softmax.py#L99 This is then in triton: Here it adds the rpe: https://github.com/microsoft/DeepSpeed/blob/ec8b1cb0a0a5752bba029da4bdc91616c0f5bec7/deepspeed/ops/sparse_attention/trsrc/softmax_fwd.tr#L117 which is before the softmax at: https://github.com/microsoft/DeepSpeed/blob/ec8b1cb0a0a5752bba029da4bdc91616c0f5bec7/deepspeed/ops/sparse_attention/trsrc/softmax_fwd.tr#L128

ParallelTransformer Changes to use the same rpe module in every layer: https://github.com/EleutherAI/megatron-3d/commit/c14647d4844047d37b084c03817ac752403e8da7 https://github.com/EleutherAI/megatron-3d/commit/ad7f755761130372ecccb8ccc6986aa6a55d1084 according to the T5 publication: For efficiency, we also share the position embedding parameters across all layers in our model, though within a given layer each attention head uses a different learned position embedding.

RelativePositionBias (= T5 rpe) n_heads=self.num_attention_heads_per_partition

MicPie commented 3 years ago

Steps I needed to get the setup running:

For conda env: 1.) conda create -n megatron-3d python=3.8 2.) conda install pytorch torchvision torchaudio cudatoolkit=10.1 -c pytorch 3.) pip install -r requirements.txt 4.) pip install einops 5.) Install apex based on https://github.com/NVIDIA/apex#linux

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

6.) conda install regex

Optional: Then you can prepare the data: python prepare_data.py

MicPie commented 3 years ago

Update to rpe function to not use attention weights and just hand over rpe matrix (see: https://github.com/EleutherAI/megatron-3d/commit/be7821a23be4bf6f9bc402f5b61a89edc336a357).

EleutherAI / megatron-3d

add T5 positional encoding #9