test_launch_ffn_bias_bwd failed

401qingkong commented 2 years ago

I install from source with this project version https://github.com/bytedance/lightseq/tree/714577d4ae8d5fa6bc6324f47c0641229a914590 and when I test ther ffn_bias_bwd case using kernel column_sum_reduc with ther following command, CUDA_VISIBLE_DEVICES=0 python3 tests/test_ls_kernels.py It fails randomly with the half data format . The output is:

root@e55c98b55e40:CUDA_VISIBLE_DEVICES=0 python3 tests/test_ls_kernels.py 1661217997422

Why does this error occur？ And will it lead to the model training failure

Taka152 commented 2 years ago

We have found the same error but don't know why for now, it won't influence model training.

Taka152 commented 2 years ago

We have found the same error but don't know why for now, it won't influence model training.

On Mon, Aug 22, 2022 at 6:03 PM 401qingkong @.***> wrote:

I install from source with this project version https://github.com/bytedance/lightseq/tree/714577d4ae8d5fa6bc6324f47c0641229a914590 http://url and when I test ther ffn_bias_bwd case using kernel column_sum_reduc with ther following command, CUDA_VISIBLE_DEVICES=0 python3 tests/test_ls_kernels.py It fails randomly with the half data format . The output is:

@.***:CUDA_VISIBLE_DEVICES=0 python3 tests/test_ls_kernels.py Using /root/.cache/torch_extensions as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/lightseq_kernels/build.ninja... Building extension module lightseq_kernels... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module lightseq_kernels... Time to load lightseq_kernels op: 0.21497797966003418 seconds Using /root/.cache/torch_extensions as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/adam/build.ninja... Building extension module adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module adam... Time to load adam op: 0.19832324981689453 seconds

test_launch_ffn_bias_bwd, ntest [0], dtype [torch.float32]: (rows, cols): (2790, 2816) Run baseline... Run custom... Compare the results of custom and baseline... Test passed. Time of custom/baseline (ms): 0.058 / 0.088, speedup: 1.516 test_launch_ffn_bias_bwd, ntest [0], dtype [torch.float16]: (rows, cols): (6560, 3968) Run baseline... Run custom... Compare the results of custom and baseline... Test passed. Time of custom/baseline (ms): 0.099 / 0.378, speedup: 3.827 test_launch_ffn_bias_bwd, ntest [1], dtype [torch.float32]: (rows, cols): (1400, 4352) Run baseline... Run custom... Compare the results of custom and baseline... Test passed. Time of custom/baseline (ms): 0.045 / 0.083, speedup: 1.857 test_launch_ffn_bias_bwd, ntest [1], dtype [torch.float16]: (rows, cols): (2691, 21888) Run baseline... Run custom... Compare the results of custom and baseline... Test passed. Time of custom/baseline (ms): 0.258 / 0.765, speedup: 2.960 test_launch_ffn_bias_bwd, ntest [2], dtype [torch.float32]: (rows, cols): (3096, 1344) Run baseline... Run custom... Compare the results of custom and baseline... Test passed. Time of custom/baseline (ms): 0.036 / 0.071, speedup: 1.942 test_launch_ffn_bias_bwd, ntest [2], dtype [torch.float16]: (rows, cols): (7567, 7040) Run baseline... Run custom... Compare the results of custom and baseline... Test passed. Time of custom/baseline (ms): 0.206 / 0.699, speedup: 3.399 test_launch_ffn_bias_bwd, ntest [3], dtype [torch.float32]: (rows, cols): (6490, 3264) Run baseline... Run custom... Compare the results of custom and baseline... Test passed. Time of custom/baseline (ms): 0.112 / 0.142, speedup: 1.269 test_launch_ffn_bias_bwd, ntest [3], dtype [torch.float16]: (rows, cols): (8906, 7552) Run baseline... Run custom... Compare the results of custom and baseline... torch.allclose failed, use numpy.allclose to log detail. Unmatches in the 0-th tensor.

Not equal to tolerance rtol=1e-05, atol=1e-05

Mismatched elements: 1 / 7552 (0.0132%) Max absolute difference: 4. Max relative difference: 0.000897 x: array([4468., 4476., 4460., ..., 4480., 4452., 4420.], dtype=float16) y: array([4468., 4476., 4460., ..., 4480., 4452., 4420.], dtype=float16)

Why does this error occur？ And will it lead to the model training failure

— Reply to this email directly, view it on GitHub https://github.com/bytedance/lightseq/issues/363, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELIZAOJDM4YUWTKWK74NK3V2NF5ZANCNFSM57HBBTVA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

bytedance / lightseq

test_launch_ffn_bias_bwd failed #363