ROCm / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
17 stars 14 forks source link

Enable --fast_layer_norm for ROCm #94

Open hubertlu-tw opened 1 year ago

hubertlu-tw commented 1 year ago

To run the --fast_layer_norm unit tests,

cd apex/contrib/test
pytest layer_norm/

As a reference, the following results were obtained from CUDA systems (with nvcr.io/nvidia/pytorch:22.08-py3 on a A100 node):

=========================================================================== test session starts ============================================================================
platform linux -- Python 3.8.13, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /apex_development/apex
plugins: cov-3.0.0, pythonpath-0.7.4, hypothesis-4.50.8
collected 4 items

layer_norm/test_fast_layer_norm.py E...                                                                                                                              [100%]

================================================================================== ERRORS ==================================================================================
_________________________________________________________________________ ERROR at setup of test_ __________________________________________________________________________
file /apex_development/apex/apex/contrib/test/layer_norm/test_fast_layer_norm.py, line 128
  def test_(S, B, hidden_size, itype, wtype, ctype=fp32):
E       fixture 'S' not found
>       available fixtures: cache, capfd, capfdbinary, caplog, capsys, capsysbinary, cov, doctest_namespace, monkeypatch, no_cover, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

/apex_development/apex/apex/contrib/test/layer_norm/test_fast_layer_norm.py:128
========================================================================= short test summary info ==========================================================================
ERROR layer_norm/test_fast_layer_norm.py::test_
======================================================================== 3 passed, 1 error in 3.61s ========================================================================

On ROCm, it failed the following two checks with NaN outputs:

print(f"dg: relerr={re_dg:.4e} mse={mse_dg:.4e}")
print(f"db: relerr={re_db:.4e} mse={mse_db:.4e}")

in test_fast_layer_norm.py when wtype or ctype is bf16.

However, when we skip the tests with bf16 wtype or ctype, it failed the tests starting from hidden_size=8192 due to relerr > tol. Please find the results of the unit tests here: apex_fastlayernorm_unittest.txt

aspanday commented 1 year ago

Reopening

amathews-amd commented 1 year ago

What is this blocked by ? cc: @jeffdaily @hubertlu-tw @sunway513 @dllehr-amd @aspanday

aspanday commented 1 year ago

Hi all,

I've updated all files with appropriate changes and ALL fast_layer_norm tests pass for fwd and bwd now. Please let me know if there are any concerns. Thanks!

Performance uplift can be found on the "fwd + bwd all" sheet [here]

(https://amdcloud.sharepoint.com/:x:/r/sites/MLSEPerfTeam/Shared%20Documents/Strategic%20Leadership%20Optimizations%20Team%20(SLOT)/Projects/FastLayerNorm/FastLayerNormBwd%20Ops%20breakdown.xlsx?d=w04019b500104408fbb5a8f4f37bdae5f&csf=1&web=1&e=aYfRlX).

Let me know if you need access to the above link.