Open 401qingkong opened 2 years ago
We have found the same error but don't know why for now, it won't influence model training.
We have found the same error but don't know why for now, it won't influence model training.
On Mon, Aug 22, 2022 at 6:03 PM 401qingkong @.***> wrote:
I install from source with this project version https://github.com/bytedance/lightseq/tree/714577d4ae8d5fa6bc6324f47c0641229a914590 http://url and when I test ther ffn_bias_bwd case using kernel column_sum_reduc with ther following command, CUDA_VISIBLE_DEVICES=0 python3 tests/test_ls_kernels.py It fails randomly with the half data format . The output is:
@.***:CUDA_VISIBLE_DEVICES=0 python3 tests/test_ls_kernels.py Using /root/.cache/torch_extensions as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/lightseq_kernels/build.ninja... Building extension module lightseq_kernels... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module lightseq_kernels... Time to load lightseq_kernels op: 0.21497797966003418 seconds Using /root/.cache/torch_extensions as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/adam/build.ninja... Building extension module adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module adam... Time to load adam op: 0.19832324981689453 seconds
test_launch_ffn_bias_bwd, ntest [0], dtype [torch.float32]: (rows, cols): (2790, 2816) Run baseline... Run custom... Compare the results of custom and baseline... Test passed. Time of custom/baseline (ms): 0.058 / 0.088, speedup: 1.516 test_launch_ffn_bias_bwd, ntest [0], dtype [torch.float16]: (rows, cols): (6560, 3968) Run baseline... Run custom... Compare the results of custom and baseline... Test passed. Time of custom/baseline (ms): 0.099 / 0.378, speedup: 3.827 test_launch_ffn_bias_bwd, ntest [1], dtype [torch.float32]: (rows, cols): (1400, 4352) Run baseline... Run custom... Compare the results of custom and baseline... Test passed. Time of custom/baseline (ms): 0.045 / 0.083, speedup: 1.857 test_launch_ffn_bias_bwd, ntest [1], dtype [torch.float16]: (rows, cols): (2691, 21888) Run baseline... Run custom... Compare the results of custom and baseline... Test passed. Time of custom/baseline (ms): 0.258 / 0.765, speedup: 2.960 test_launch_ffn_bias_bwd, ntest [2], dtype [torch.float32]: (rows, cols): (3096, 1344) Run baseline... Run custom... Compare the results of custom and baseline... Test passed. Time of custom/baseline (ms): 0.036 / 0.071, speedup: 1.942 test_launch_ffn_bias_bwd, ntest [2], dtype [torch.float16]: (rows, cols): (7567, 7040) Run baseline... Run custom... Compare the results of custom and baseline... Test passed. Time of custom/baseline (ms): 0.206 / 0.699, speedup: 3.399 test_launch_ffn_bias_bwd, ntest [3], dtype [torch.float32]: (rows, cols): (6490, 3264) Run baseline... Run custom... Compare the results of custom and baseline... Test passed. Time of custom/baseline (ms): 0.112 / 0.142, speedup: 1.269 test_launch_ffn_bias_bwd, ntest [3], dtype [torch.float16]: (rows, cols): (8906, 7552) Run baseline... Run custom... Compare the results of custom and baseline... torch.allclose failed, use numpy.allclose to log detail. Unmatches in the 0-th tensor.
Not equal to tolerance rtol=1e-05, atol=1e-05
Mismatched elements: 1 / 7552 (0.0132%) Max absolute difference: 4. Max relative difference: 0.000897 x: array([4468., 4476., 4460., ..., 4480., 4452., 4420.], dtype=float16) y: array([4468., 4476., 4460., ..., 4480., 4452., 4420.], dtype=float16)
Why does this error occur? And will it lead to the model training failure
— Reply to this email directly, view it on GitHub https://github.com/bytedance/lightseq/issues/363, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELIZAOJDM4YUWTKWK74NK3V2NF5ZANCNFSM57HBBTVA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I install from source with this project version https://github.com/bytedance/lightseq/tree/714577d4ae8d5fa6bc6324f47c0641229a914590 and when I test ther ffn_bias_bwd case using kernel column_sum_reduc with ther following command,
CUDA_VISIBLE_DEVICES=0 python3 tests/test_ls_kernels.py
It fails randomly with the half data format . The output is:root@e55c98b55e40:CUDA_VISIBLE_DEVICES=0 python3 tests/test_ls_kernels.py
Why does this error occur? And will it lead to the model training failure