[BUG] Enabling regularization causes CUDNN_STATUS_MAPPING_ERROR for deepfm example

klmentzer commented 8 months ago

Describe the bug Enabling regularization causes CUDNN_STATUS_MAPPING_ERROR for deepfm example (runs without problem without regularization). Also, using a keyword argument lambda to specify the regularization parameter causes a syntax error (though this can be avoided by using **{"lambda": 1e-3} as an argument).

To Reproduce Steps to reproduce the behavior:

Follow the instructions for the DeepFM sample here
Add the keyword argument use_regularization=True to the hugectr.Layer_t.BinaryCrossEntropyLoss layer and run the code to generate CUDNN_STATUS_MAPPING_ERROR.
(just for syntax error) Specify the lambda regularization parameter and attempt to rerun.

Expected behavior The model should train with regularization, and the keyword argument does not cause a syntax error.

Screenshots

=====================================================Model Fit=====================================================
[HCTR][00:16:49.881][INFO][RK0][main]: Use non-epoch mode with number of iterations: 2300
[HCTR][00:16:49.881][INFO][RK0][main]: Training batchsize: 16384, evaluation batchsize: 16384
[HCTR][00:16:49.881][INFO][RK0][main]: Evaluation interval: 1000, snapshot interval: 1000000
[HCTR][00:16:49.881][INFO][RK0][main]: Dense network trainable: True
[HCTR][00:16:49.881][INFO][RK0][main]: Sparse embedding sparse_embedding1 trainable: True
[HCTR][00:16:49.881][INFO][RK0][main]: Use mixed precision: False, scaler: 1.000000, use cuda graph: True
[HCTR][00:16:49.881][INFO][RK0][main]: lr: 0.001000, warmup_steps: 1, end_lr: 0.000000
[HCTR][00:16:49.881][INFO][RK0][main]: decay_start: 0, decay_steps: 1, decay_power: 2.000000
[HCTR][00:16:49.881][INFO][RK0][main]: Training source file: ./criteo_data/train/_file_list.txt
[HCTR][00:16:49.881][INFO][RK0][main]: Evaluation source file: ./criteo_data/val/_file_list.txt
terminate called after throwing an instance of 'HugeCTR::core23::RuntimeError'
  what():  Runtime error: CUDNN_STATUS_MAPPING_ERROR
        cudnnSetStream(cudnn_handle_, current_stream) (set_stream @ /hugectr/HugeCTR/include/gpu_resource.hpp:80)
[bf8877f31c66:585273] *** Process received signal ***
[bf8877f31c66:585273] Signal: Aborted (6)
[bf8877f31c66:585273] Signal code:  (-6)
[bf8877f31c66:585273] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f54ea3c5520]
[bf8877f31c66:585273] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f54ea4199fc]
[bf8877f31c66:585273] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f54ea3c5476]
[bf8877f31c66:585273] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f54ea3ab7f3]
[bf8877f31c66:585273] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f54e3257b9e]
[bf8877f31c66:585273] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f54e326320c]
[bf8877f31c66:585273] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f54e32621e9]
[bf8877f31c66:585273] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f54e3262959]
[bf8877f31c66:585273] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f54e4225884]
[bf8877f31c66:585273] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7f54e4225f41]
[bf8877f31c66:585273] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b)[0x7f54e32634cb]
[bf8877f31c66:585273] [11] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR11GPUResource10set_streamERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi+0x345)[0x7f54e4dc33f5]
[bf8877f31c66:585273] [12] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR13StreamContextD1Ev+0x1b)[0x7f54e4dc367b]
[bf8877f31c66:585273] [13] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0x2736f3)[0x7f54e45986f3]
[bf8877f31c66:585273] [14] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0xa9bf69)[0x7f54e4dc0f69]
[bf8877f31c66:585273] [15] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR12GraphWrapper7captureESt8functionIFvP11CUstream_stEES3_+0x7b)[0x7f54e4a4452b]
[bf8877f31c66:585273] [16] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR17GraphScheduleable3runESt10shared_ptrINS_11GPUResourceEEb+0x1cc)[0x7f54e4dc0a5c]
[bf8877f31c66:585273] [17] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR8Pipeline9run_graphEv+0x10e)[0x7f54e4dc11ae]
[bf8877f31c66:585273] [18] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0xb025a8)[0x7f54e4e275a8]
[bf8877f31c66:585273] [19] /usr/lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x46)[0x7f54aadcaa16]
[bf8877f31c66:585273] [20] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR5Model5trainEv+0x14c)[0x7f54e4e2695c]
[bf8877f31c66:585273] [21] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR5Model3fitEiiiiiNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xb97)[0x7f54e4e3ce87]
[bf8877f31c66:585273] [22] /usr/local/hugectr/lib/hugectr.so(+0xdd164)[0x7f54e9f2d164]
[bf8877f31c66:585273] [23] /usr/local/hugectr/lib/hugectr.so(+0xa3644)[0x7f54e9ef3644]
[bf8877f31c66:585273] [24] python(+0x15a10e)[0x56453d33b10e]
[bf8877f31c66:585273] [25] python(_PyObject_MakeTpCall+0x25b)[0x56453d331a7b]
[bf8877f31c66:585273] [26] python(+0x168acb)[0x56453d349acb]
[bf8877f31c66:585273] [27] python(_PyEval_EvalFrameDefault+0x198c)[0x56453d32553c]
[bf8877f31c66:585273] [28] python(+0x13f9c6)[0x56453d3209c6]
[bf8877f31c66:585273] [29] python(PyEval_EvalCode+0x86)[0x56453d416256]
[bf8877f31c66:585273] *** End of error message ***
Aborted (core dumped)

Environment (please complete the following information):

OS: Ubuntu 22.04.2 LTS
Graphic card: NVIDIA H100 PCIe
CUDA version: 12.2
Docker image - https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-hugectr

Thanks for your help!

JacoCheung commented 6 months ago

Hi @klmentzer , Thanks for your trial. There is a bug when the regularizer is used together with solver.use_cuda_graph=True. We will fix the bug in the upcoming release. Could you please disable cuda graph as a WAR?

Abatpool commented 3 months ago

Is there any solution to this. I am getting the same issues, when trying run dlrm training v3.1 benchmarking with DGX H100. I have tried with next version v23.08.00 Nvidia-Merlin/HugeCTR like v23.09.00 and latest one too, but the same error persists. Can you please tell me how do we fix it. @JacoCheung

JacoCheung commented 3 months ago

Hi @Abatpool , have you tried turning cuda_graph off?

Abatpool commented 3 months ago

Hi @Abatpool , have you tried turning cuda_graph off?

Did turn it into false, and used Nvidia-Merlin/HugeCTR like v24.04.00(verified release) still facing the same error as attached in screenshot below

NVIDIA-Merlin / HugeCTR

[BUG] Enabling regularization causes CUDNN_STATUS_MAPPING_ERROR for deepfm example #445