ROCm / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
17 stars 14 forks source link

test_mlp benchmark got accuracy assert error #134

Open ZJLi2013 opened 2 months ago

ZJLi2013 commented 2 months ago

hi, team,

I tried to benchmark on mlp implement with following:

env setup

GPU: MI308
rocm: 6.1.2.60102-119~20.04
pytorch: 2.4.0.dev20240501+rocm6.1

how to duplicate the process

cd apex/
pip install -r requirements.txt 
python3 setup.py install 
cd tests/run_mlp
python3 test_mlp.py

accuracy errors


FAIL: test_no_bias (__main__.TestMLP)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/apex-1.3.0-py3.9-linux-x86_64.egg/apex/testing/common_utils.py", line 32, in wrapper
    fn(*args, **kwargs)
  File "/workspace/apex/tests/L0/run_mlp/test_mlp.py", line 77, in test_no_bias
    np.testing.assert_allclose(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-07

Mismatched elements: 2 / 1024 (0.195%)
Max absolute difference: 1.8835999e-07
Max relative difference: 0.00286722
 x: array([[ 0.027259],
       [-0.054091],
       [-0.003985],...
 y: array([[ 0.027259],
       [-0.054091],
       [-0.003985],...

FAIL: test_no_grad (__main__.TestMLP)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/apex-1.3.0-py3.9-linux-x86_64.egg/apex/testing/common_utils.py", line 32, in wrapper
    fn(*args, **kwargs)
  File "/workspace/apex/tests/L0/run_mlp/test_mlp.py", line 163, in test_no_grad
    np.testing.assert_allclose(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-07

Mismatched elements: 140028 / 491520 (28.5%)
Max absolute difference: 7.2151306e-07
Max relative difference: 951.6179
 x: array([[-2.597046e-05,  4.191594e-06, -6.009603e-05, ...,  2.606537e-04,
         6.171300e-05, -6.382344e-05],
       [ 1.673573e-05, -5.885254e-05, -8.349993e-05, ..., -7.531334e-05,...
 y: array([[-2.654276e-05,  3.729273e-06, -6.010690e-05, ...,  2.608544e-04,
         6.124897e-05, -6.427429e-05],
       [ 1.673585e-05, -5.885251e-05, -8.349984e-05, ..., -7.531326e-05,...

FAIL: test_with_bias (__main__.TestMLP)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/apex-1.3.0-py3.9-linux-x86_64.egg/apex/testing/common_utils.py", line 32, in wrapper
    fn(*args, **kwargs)
  File "/workspace/apex/tests/L0/run_mlp/test_mlp.py", line 116, in test_with_bias
    np.testing.assert_allclose(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-07

Mismatched elements: 3 / 1024 (0.293%)
Max absolute difference: 1.899898e-07
Max relative difference: 0.00063155
 x: array([[-0.128916],
       [-0.052111],
       [ 0.001069],...
 y: array([[-0.128916],
       [-0.052111],
       [ 0.001069],...

----------------------------------------------------------------------
Ran 6 tests in 16.497s

FAILED (failures=3)

is a special torch/rocm version required for this benchmark ?

many thanks David