Open haitwang-cloud opened 5 months ago
@wawa0210 @archlitchi PTAL
You can use the environment variable LIBCUDA_LOG_LEVEL
to increase the logging level of the hami core and obtain more context
Append the log after set the LIBCUDA_LOG_LEVEL
to 4
(base) (⎈|N/A:N/A)➜ cat output.txt | grep -i error
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlErrorString:2
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceClearEccErrorCounts:10
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetDetailedEccErrors:38
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetMemoryErrorCounter:67
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetNvLinkErrorCounter:75
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetTotalEccErrors:108
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceResetNvLinkErrorCounters:125
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorString
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorName
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorString
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorName
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorString:6000
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorName:6000
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorString:6000
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorName:6000
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
RuntimeError: Failed to find C compiler. Please specify via CC environment variable.
torch._dynamo.config.suppress_errors = True
在将
LIBCUDA_LOG_LEVEL``4
(base) (⎈|N/A:N/A)➜ cat output.txt | grep -i error [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlErrorString:2 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceClearEccErrorCounts:10 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetDetailedEccErrors:38 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetMemoryErrorCounter:67 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetNvLinkErrorCounter:75 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetTotalEccErrors:108 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceResetNvLinkErrorCounters:125 [HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorString [HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorName [HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorString [HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorName [HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorString:6000 [HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorName:6000 [HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorString:6000 [HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorName:6000 File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors RuntimeError: Failed to find C compiler. Please specify via CC environment variable. torch._dynamo.config.suppress_errors = True
This looks more like a container environment issue
Today, I had an offline debug session with @archlitchi . Despite setting CUDA_DISABLE_CONTROL
to true and removing ld.so.preload
from the GPU node, the issue persisted. We suspect that this is because Hami is using the v1.4.0 Nvidia device plugin, which may be the reason why nanoGPT cannot run. We need to install a clean Nvidia device plugin v1.4.0 to confirm this. If our suspicion is correct, we might need to upgrade the Nvidia device plugin in Hami.
Confirmed that the issue mentioned also occurs in version 0.14.0 of k8s-device-plugin. Thus, we should update k8s-device-plugin to at least version 0.14.5.
1. Issue or feature description
An issue has been identified when trying to run https://github.com/karpathy/nanoGPT with the HAMi framework; it's currently unsuccessful. However, when the same code is run using the official https://github.com/NVIDIA/k8s-device-plugin, it operates smoothly. This inconsistency may be attributed to HAMi's use of CUDA hijacking Ref #46 . A closer examination of the Hami-Core's functionality or configuration might be needed to pinpoint the problem.
Related GPU Environments
2. Steps to reproduce the issue
Follow the quickStart in https://github.com/karpathy/nanoGPT?tab=readme-ov-file#quick-start
3. Information to attach (optional if deemed irrelevant)
3. Details error
nvidia-smi -a