ROCm / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
18 stars 14 forks source link

Enable multihead atten #56

Closed hubertlu-tw closed 2 years ago

hubertlu-tw commented 2 years ago

Installation:

python setup.py install --cpp_ext --cuda_ext --distributed_adam --xentropy --deprecated_fused_adam --fast_multihead_attn 2>&1 | tee ../apex_build_mha.log

Unit tests

cd tests/L0/ && bash run_rocm.sh 2>&1 | tee ../../apex_unittests.txt cd tests/distributed/ && bash run_rocm_distributed.sh 2>&1 | tee ../../apex_distributed_unittests.txt

Notice that I have built the apex on a MI200 server and confirmed that there is no new failing unit tests introduced.

Unit tests for the extension

(https://github.com/ROCmSoftwarePlatform/apex/tree/dev/hubertlu/multihead_attn/apex/contrib/test/multihead_attn) <html xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

  | CUDA | ROCm -- | -- | -- test_encdec_multihead_attn.py | PASS | PASS test_encdec_multihead_attn_norm_add.py | PASS | PASS test_fast_self_multihead_attn_bias.py | PASS | **FAILED** test_mha_fused_softmax.py | PASS | PASS test_self_multihead_attn.py | PASS | PASS test_self_multihead_attn_norm_add.py | FAILED | FAILED

Notice that the features for the failed test in test_fast_self_multihead_attn_bias.py on ROCm are not used by MLPerf team. We will need to root cause it later. In addition, the failed test in test_self_multihead_attn_norm_add.py is due to a missing 1 required positional argument in the upstream (NVIDIA) script.

Lastly, the current CI checks do not run the unit tests for extensions (such as groupbn, layer_norm, multihead_attn, and test_label_smoothing.py). We will need to add them to our CI checks later.