NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
946 stars 200 forks source link

[Question]loss_test not stable, sometimes some cases will fail #394

Closed heroes999 closed 1 year ago

heroes999 commented 1 year ago

I download docker nvcr.io/nvidia/merlin/merlin-hugectr:22.11 and run some unit test cases. I find that loss_test is not stable, 1-3 cases will fail randomly. I just want to make sure it is not my own environment issue.

My environment: NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.7 A100 80G hugectr: 22.11

execution cmd: ./loss_test

kanghui0204 commented 1 year ago

Hi @heroes999 ,we think the ut loss_test result is stable, and works well . Could you post the log with your fail job in here?

heroes999 commented 1 year ago

of course @kanghui0204

root@692e9a28f38e:/opt/tritonserver# loss_test
Running main() from /hugectr/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from loss_test
[ RUN      ] loss_test.CrossEntropyLoss_2048_row_major
[HCTR][06:16:39.594][INFO][RK0][main]: gpu_tmp(0.000000) - cpu_p(0.716438) >= 1e-5 when i = 0
/hugectr/test/utest/loss/loss_test.cpp:107: Failure
Expected equality of these values:
  true
  cpu_gpu_cmp(&cpu_loss, d_loss, 1)
    Which is: false
 CSE Loss calulation failed

[  FAILED  ] loss_test.CrossEntropyLoss_2048_row_major (11814 ms)
[ RUN      ] loss_test.CrossEntropyLoss_64_row_major
[HCTR][06:16:39.611][INFO][RK0][main]: gpu_tmp(0.000000) - cpu_p(0.667688) >= 1e-5 when i = 0
/hugectr/test/utest/loss/loss_test.cpp:107: Failure
Expected equality of these values:
  true
  cpu_gpu_cmp(&cpu_loss, d_loss, 1)
    Which is: false
 CSE Loss calulation failed

[  FAILED  ] loss_test.CrossEntropyLoss_64_row_major (15 ms)
[ RUN      ] loss_test.BinaryCrossEntropyLoss_2048
[HCTR][06:16:39.632][INFO][RK0][main]: gpu_tmp(0.000000) - cpu_p(0.733209) >= 1e-5 when i = 0
/hugectr/test/utest/loss/loss_test.cpp:181: Failure
Expected equality of these values:
  true
  cpu_gpu_cmp(&cpu_loss, d_loss, 1)
    Which is: false
 CSE Loss calulation failed

[  FAILED  ] loss_test.BinaryCrossEntropyLoss_2048 (21 ms)
[ RUN      ] loss_test.BinaryCrossEntropyLoss_64
[       OK ] loss_test.BinaryCrossEntropyLoss_64 (15 ms)
[ RUN      ] loss_test.MultiCrossEntropyLoss11_1024
[HCTR][06:16:39.669][INFO][RK0][main]: gpu_tmp(0.000046) - cpu_p(-0.000043) >= 1e-5 when i = 3617
/hugectr/test/utest/loss/multi_cross_entropy_loss_test.cpp:99: Failure
Expected equality of these values:
  true
  cpu_gpu_cmp(h_input.get(), d_input, batch_size * label_dim)
    Which is: false
 CSE Gradient calulation failed

[  FAILED  ] loss_test.MultiCrossEntropyLoss11_1024 (27 ms)
[----------] 5 tests from loss_test (11892 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (11892 ms total)
[  PASSED  ] 1 test.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] loss_test.CrossEntropyLoss_2048_row_major
[  FAILED  ] loss_test.CrossEntropyLoss_64_row_major
[  FAILED  ] loss_test.BinaryCrossEntropyLoss_2048
[  FAILED  ] loss_test.MultiCrossEntropyLoss11_1024

 4 FAILED TESTS
[HCTR][06:16:39.728][INFO][RK0][main]: MPI finalization done.
kanghui0204 commented 1 year ago

Hi @heroes999 ,sorry , we don't know how to debug on this log , so I suggest you 2 ways to check if the error still exist. 1.I have tried 22.11 on our machine , and works will , here is our machine cuda version and driver version , if you can update them to our version , and check the problem still exist: NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 12.0 2.If you can't update base software easily , you can use newest container and newest code , and check the problem still exist.

heroes999 commented 1 year ago

ok, I will have a try. Btw,does 22.11 correspond to hugeCTR v4.2 tag?

JacoCheung commented 1 year ago

does 22.11 correspond to hugeCTR v4.2 tag?

I think so. @heroes999 You can confim here

heroes999 commented 1 year ago

It turns out that under another A100 env, loss_test passes