NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
937 stars 200 forks source link

[Question]Running the DCN on a single GPU leads to the illegal memory access #419

Open dusir opened 1 year ago

dusir commented 1 year ago

This template is for generic questions that a user may have in using HugeCTR.

Note: Before filing an issue, you may want to check out our compiled Q&A list first.

环境信息: 虚拟机环境; 单GPU卡训练; 容器里跑训练代码;

重要组件版本: kernel:5.4.119-19.0009.28 hugectr:23.06 Driver Version: 535.54.03 CUDA Version: 12.2

现象: root@5583dc65ca3a:/home/workspace/gq# python dcn_init.py MpiInitService: MPI was already initialized by another (non-HugeCTR) mechanism. [HCTR][13:48:01.955][INFO][RK0][main]: Empty embedding, trained table will be stored in /root/dcn_test/ HugeCTR Version: 23.6 ====================================================Model Init===================================================== [HCTR][13:48:01.956][INFO][RK0][main]: Initialize model: dcn_test [HCTR][13:48:01.956][INFO][RK0][main]: Global seed is 3137833461 [HCTR][13:48:02.048][INFO][RK0][main]: Device to NUMA mapping: GPU 2 -> node 1 [HCTR][13:48:02.843][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled. [HCTR][13:48:02.843][DEBUG][RK0][main]: [device 2] allocating 0.0000 GB, available 78.2526 [HCTR][13:48:02.843][INFO][RK0][main]: Start all2all warmup [HCTR][13:48:02.845][INFO][RK0][main]: End all2all warmup [HCTR][13:48:03.363][INFO][RK0][main]: Using All-reduce algorithm: NCCL set_mempolicy: Operation not permitted [HCTR][13:48:03.374][INFO][RK0][main]: Device 2: NVIDIA H800 [HCTR][13:48:03.386][INFO][RK0][main]: eval source /root/keyset_dir/eval.txt max_row_group_size 2565543 [HCTR][13:48:03.397][INFO][RK0][main]: train source /root/keyset_dir/0.txt max_row_group_size 2565543 [HCTR][13:48:03.408][INFO][RK0][main]: train source /root/keyset_dir/1.txt max_row_group_size 2565543 [HCTR][13:48:03.420][INFO][RK0][main]: train source /root/keyset_dir/2.txt max_row_group_size 2565543 [HCTR][13:48:03.420][INFO][RK0][main]: num of DataReader workers for train: 1 [HCTR][13:48:03.420][INFO][RK0][main]: num of DataReader workers for eval: 1 set_mempolicy: Operation not permitted set_mempolicy: Operation not permitted set_mempolicy: Operation not permitted set_mempolicy: Operation not permitted

[HCTR][13:48:03.496][DEBUG][RK0][main]: [device 2] allocating 27.4671 GB, available 35.8756 [HCTR][13:48:03.496][INFO][RK0][main]: gpu sync_all_gpus gpu sync_all_gpus resource_manager:12 [HCTR][13:48:03.496][INFO][RK0][main]: gpu sync_all_gpus gpu count 1 [HCTR][13:48:03.496][INFO][RK0][main]: gpu sync_all_gpus local gpu 0x3220dd0 [HCTR][13:48:03.496][INFO][RK0][main]: set device start... 2 [HCTR][13:48:03.496][INFO][RK0][main]: set device done,device id is 2 [HCTR][13:48:03.496][INFO][RK0][main]: set device done. [HCTR][13:48:03.496][INFO][RK0][main]: set device done,stream ptr: 0x4926810 [HCTR][13:48:03.501][INFO][RK0][main]: synchronize done. [HCTR][13:48:03.502][INFO][RK0][main]: Graph analysis to resolve tensor dependency [HCTR][13:48:03.502][INFO][RK0][main]: Add Slice layer for tensor: reshape1, creating 2 copies [HCTR][13:48:03.502][WARNING][RK0][main]: using multi-cross v1 [HCTR][13:48:03.507][WARNING][RK0][main]: using multi-cross v1 ===================================================Model Compile=================================================== [HCTR][13:49:03.167][INFO][RK0][main]: gpu0 start to init embedding [HCTR][13:49:03.187][INFO][RK0][main]: gpu0 init embedding done [HCTR][13:49:03.187][INFO][RK0][main]: Enable HMEM-Based Parameter Server [HCTR][13:49:03.187][INFO][RK0][main]: /root/dcn_test/ not exist, create and train from scratch [HCTR][13:49:15.625][DEBUG][RK0][main]: [device 2] allocating 1.0864 GB, available 26.7877 [HCTR][13:49:15.629][INFO][RK0][main]: Starting AUC NCCL warm-up terminate called after throwing an instance of 'HugeCTR::core23::RuntimeError' what(): Runtime error: an illegal memory access was encountered cudaStreamSynchronize(stream) at run_finalize_step (/home/HugeCTR/HugeCTR/src/metrics.cu:1814) [5583dc65ca3a:00137] Process received signal [5583dc65ca3a:00137] Signal: Aborted (6) [5583dc65ca3a:00137] Signal code: (-6) [5583dc65ca3a:00137] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fc6ff3f6090] [5583dc65ca3a:00137] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fc6ff3f600b] [5583dc65ca3a:00137] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fc6ff3d5859] [5583dc65ca3a:00137] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7fc6f9b53911] [5583dc65ca3a:00137] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7fc6f9b5f38c] [5583dc65ca3a:00137] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa9369)[0x7fc6f9b5e369] [5583dc65ca3a:00137] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a1)[0x7fc6f9b5ed21] [5583dc65ca3a:00137] [ 7] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10bef)[0x7fc6f9aaabef] [5583dc65ca3a:00137] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12a)[0x7fc6f9aab5aa] [5583dc65ca3a:00137] [ 9] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics3AUCIfE23finalize_metric_per_gpuEi+0x397)[0x7fc6fa947ad7] [5583dc65ca3a:00137] [10] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0xbd4b12)[0x7fc6fa947b12] [5583dc65ca3a:00137] [11] /usr/lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x46)[0x7fc6938698e6] [5583dc65ca3a:00137] [12] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics3AUCIfE15finalize_metricEv+0x9b)[0x7fc6fa8faf4b] [5583dc65ca3a:00137] [13] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics3AUCIfE7warm_upEm+0xb6)[0x7fc6fa8ff1b6] [5583dc65ca3a:00137] [14] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics3AUCIfEC2EiiiRKSt10shared_ptrINS_15ResourceManagerEEb+0x74a)[0x7fc6fa9326da] [5583dc65ca3a:00137] [15] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics6Metric6CreateENS0_4TypeEbiiiRKSt10shared_ptrINS_15ResourceManagerEEb+0x1b1)[0x7fc6fa8fabc1] [5583dc65ca3a:00137] [16] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR5Model14create_metricsEv+0xc5)[0x7fc6faa513d5] [5583dc65ca3a:00137] [17] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR5Model7compileEv+0x286)[0x7fc6faa55b36] [5583dc65ca3a:00137] [18] /usr/local/hugectr/lib/hugectr.so(+0xce702)[0x7fc6fed3a702] [5583dc65ca3a:00137] [19] /usr/local/hugectr/lib/hugectr.so(+0xd0f94)[0x7fc6fed3cf94] [5583dc65ca3a:00137] [20] python(PyCFunction_Call+0x59)[0x5f6939] [5583dc65ca3a:00137] [21] python(_PyObject_MakeTpCall+0x296)[0x5f7506] [5583dc65ca3a:00137] [22] python[0x50b8d3] [5583dc65ca3a:00137] [23] python(_PyEval_EvalFrameDefault+0x5796)[0x570556] [5583dc65ca3a:00137] [24] python(_PyEval_EvalCodeWithName+0x26a)[0x5697da] [5583dc65ca3a:00137] [25] python(PyEval_EvalCode+0x27)[0x68e547] [5583dc65ca3a:00137] [26] python[0x67dbf1] [5583dc65ca3a:00137] [27] python[0x67dc6f] [5583dc65ca3a:00137] [28] python[0x67dd11] [5583dc65ca3a:00137] [29] python(PyRun_SimpleFileExFlags+0x197)[0x67fe37] [5583dc65ca3a:00137] End of error message Aborted (core dumped)

这个metrics报告的内存问题可能是个bug?

minseokl commented 1 year ago

Hi @dusir We can reproduce this issue on A100 as well. It is unrelated to H800 but the bug of our AUC implementation with a specific batch size. We are working on fixing it. Thanks!