NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
482 stars 58 forks source link

Fix noisy CUDA shutdown #8

Closed ryantwolf closed 6 months ago

ryantwolf commented 6 months ago

When scripts successfully finish, there are occasionally several errors that may appear. Some are small like:

corrupted size vs. prev_size while consolidating
Aborted

While some are larger like this block that is repeated:

==== backtrace (tid: 402204) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000072c54 ucs_topo_cleanup()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucs/sys/topo/base/topo.c:604
 2 0x00000000000429ff ucs_cleanup()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucs/sys/init.c:128
 3 0x000000000000624e __nptl_change_stack_perm()  ???:0
 4 0x0000000000045495 secure_getenv()  ???:0
 5 0x0000000000045610 exit()  ???:0
 6 0x00000000002755fb Py_Exit()  ???:0
 7 0x0000000000262b6f PyGC_Collect()  ???:0
 8 0x000000000026291d PyErr_PrintEx()  ???:0
 9 0x0000000000252992 PyRun_SimpleStringFlags()  ???:0
10 0x0000000000251b15 Py_RunMain()  ???:0
11 0x000000000022802d Py_BytesMain()  ???:0
12 0x0000000000029d90 __libc_init_first()  ???:0
13 0x0000000000029e40 __libc_start_main()  ???:0
14 0x0000000000227f25 _start()  ???:0
=================================

These started appearing after the GPU deduplication refactor, so I suspect something in there is causing them.