Closed heroes999 closed 1 year ago
Hi @heroes999 ,we think the ut loss_test result is stable, and works well . Could you post the log with your fail job in here?
of course @kanghui0204
root@692e9a28f38e:/opt/tritonserver# loss_test
Running main() from /hugectr/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from loss_test
[ RUN ] loss_test.CrossEntropyLoss_2048_row_major
[HCTR][06:16:39.594][INFO][RK0][main]: gpu_tmp(0.000000) - cpu_p(0.716438) >= 1e-5 when i = 0
/hugectr/test/utest/loss/loss_test.cpp:107: Failure
Expected equality of these values:
true
cpu_gpu_cmp(&cpu_loss, d_loss, 1)
Which is: false
CSE Loss calulation failed
[ FAILED ] loss_test.CrossEntropyLoss_2048_row_major (11814 ms)
[ RUN ] loss_test.CrossEntropyLoss_64_row_major
[HCTR][06:16:39.611][INFO][RK0][main]: gpu_tmp(0.000000) - cpu_p(0.667688) >= 1e-5 when i = 0
/hugectr/test/utest/loss/loss_test.cpp:107: Failure
Expected equality of these values:
true
cpu_gpu_cmp(&cpu_loss, d_loss, 1)
Which is: false
CSE Loss calulation failed
[ FAILED ] loss_test.CrossEntropyLoss_64_row_major (15 ms)
[ RUN ] loss_test.BinaryCrossEntropyLoss_2048
[HCTR][06:16:39.632][INFO][RK0][main]: gpu_tmp(0.000000) - cpu_p(0.733209) >= 1e-5 when i = 0
/hugectr/test/utest/loss/loss_test.cpp:181: Failure
Expected equality of these values:
true
cpu_gpu_cmp(&cpu_loss, d_loss, 1)
Which is: false
CSE Loss calulation failed
[ FAILED ] loss_test.BinaryCrossEntropyLoss_2048 (21 ms)
[ RUN ] loss_test.BinaryCrossEntropyLoss_64
[ OK ] loss_test.BinaryCrossEntropyLoss_64 (15 ms)
[ RUN ] loss_test.MultiCrossEntropyLoss11_1024
[HCTR][06:16:39.669][INFO][RK0][main]: gpu_tmp(0.000046) - cpu_p(-0.000043) >= 1e-5 when i = 3617
/hugectr/test/utest/loss/multi_cross_entropy_loss_test.cpp:99: Failure
Expected equality of these values:
true
cpu_gpu_cmp(h_input.get(), d_input, batch_size * label_dim)
Which is: false
CSE Gradient calulation failed
[ FAILED ] loss_test.MultiCrossEntropyLoss11_1024 (27 ms)
[----------] 5 tests from loss_test (11892 ms total)
[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (11892 ms total)
[ PASSED ] 1 test.
[ FAILED ] 4 tests, listed below:
[ FAILED ] loss_test.CrossEntropyLoss_2048_row_major
[ FAILED ] loss_test.CrossEntropyLoss_64_row_major
[ FAILED ] loss_test.BinaryCrossEntropyLoss_2048
[ FAILED ] loss_test.MultiCrossEntropyLoss11_1024
4 FAILED TESTS
[HCTR][06:16:39.728][INFO][RK0][main]: MPI finalization done.
Hi @heroes999 ,sorry , we don't know how to debug on this log , so I suggest you 2 ways to check if the error still exist.
1.I have tried 22.11 on our machine , and works will , here is our machine cuda version and driver version , if you can update them to our version , and check the problem still exist:
NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 12.0
2.If you can't update base software easily , you can use newest container and newest code , and check the problem still exist.
ok, I will have a try. Btw,does 22.11 correspond to hugeCTR v4.2 tag?
does 22.11 correspond to hugeCTR v4.2 tag?
I think so. @heroes999 You can confim here
It turns out that under another A100 env, loss_test passes
I download docker nvcr.io/nvidia/merlin/merlin-hugectr:22.11 and run some unit test cases. I find that loss_test is not stable, 1-3 cases will fail randomly. I just want to make sure it is not my own environment issue.
My environment: NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.7 A100 80G hugectr: 22.11
execution cmd: ./loss_test