Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
906 stars 187 forks source link

[Bug]HAMi Framework Fails to Execute nanoGPT with CUDA, While NVIDIA k8s-device-plugin Succeeds #347

Open haitwang-cloud opened 5 months ago

haitwang-cloud commented 5 months ago

1. Issue or feature description

An issue has been identified when trying to run https://github.com/karpathy/nanoGPT with the HAMi framework; it's currently unsuccessful. However, when the same code is run using the official https://github.com/NVIDIA/k8s-device-plugin, it operates smoothly. This inconsistency may be attributed to HAMi's use of CUDA hijacking Ref #46 . A closer examination of the Hami-Core's functionality or configuration might be needed to pinpoint the problem.

Related GPU Environments

Initializing a new model from scratch
number of parameters: 10.65M
num decayed parameter tensors: 26, with 10,740,096 parameters
num non-decayed parameter tensors: 13, with 4,992 parameters
using fused AdamW: True
compiling the model... (takes a ~minute)
Traceback (most recent call last):
  File "/home/jovyan/nanoGPT/train.py", line 264, in <module>
    losses = estimate_loss()
             ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/nanoGPT/train.py", line 224, in estimate_loss
    logits, loss = model(X, Y)
                   ^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 641, in _convert_frame
    result = inner_convert(frame, cache_size, hooks, frame_state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert
    return _compile(
           ^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 569, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner
    out_code = transform_code_object(code, transform)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
    transformations(instructions, code_options)
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 458, in transform
    tracer.run()
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2074, in run
    super().run()
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
    and self.step()
        ^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
    getattr(self, inst.opname)(inst)
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2162, in RETURN_VALUE
    self.output.compile_subgraph(
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 857, in compile_subgraph
    self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
  File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 957, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 1024, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 1009, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/__init__.py", line 1568, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1150, in compile_fx
    return aot_autograd(
           ^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/backends/common.py", line 55, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 3891, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 3429, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 2212, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 2392, in aot_wrapper_synthetic_base
    return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 1573, in aot_dispatch_base
    compiled_fw = compiler(fw_module, flat_args)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1092, in fw_compiler_base
    return inner_compile(
           ^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/repro/after_aot.py", line 80, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/debug.py", line 228, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 54, in newFunction
    return old_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 341, in compile_fx_inner
    compiled_graph: CompiledFxGraph = fx_codegen_and_compile(
                                      ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 565, in fx_codegen_and_compile
    compiled_fn = graph.compile_to_fn()
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/graph.py", line 970, in compile_to_fn
    return self.compile_to_module().call
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/graph.py", line 941, in compile_to_module
    mod = PyCodeCache.load_by_key_path(key, path, linemap=linemap)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 1139, in load_by_key_path
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_jovyan/w7/cw7ravsc5anhkpigvbwronnhuedvnysdh7qebhz5f5ahxmyxvbhx.py", line 905, in <module>
    async_compile.wait(globals())
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 1418, in wait
    scope[key] = result.result()
                 ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 1277, in result
    self.future.result()
  File "/opt/conda/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Failed to find C compiler. Please specify via CC environment variable.

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

2. Steps to reproduce the issue

Follow the quickStart in https://github.com/karpathy/nanoGPT?tab=readme-ov-file#quick-start

3. Information to attach (optional if deemed irrelevant)

3. Details error

nvidia-smi -a

base) jovyan@nanogpt-0:~/nanoGPT$ nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Thu Jun  6 08:39:56 2024
Driver Version                            : 535.86.10
[HAMI-core Msg(571:140025334466368:libvgpu.c:836)]: Initializing.....
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:AF:00.0
    Product Name                          : Tesla V100-PCIE-16GB
    Product Brand                         : Tesla
    Product Architecture                  : Volta
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : N/A
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0320218190518
    GPU UUID                              : GPU-3a6ec8f0-24eb-1905-1f17-7bdb4e850ffa
    Minor Number                          : 1
    VBIOS Version                         : 88.00.1A.00.03
    MultiGPU Board                        : No
    Board ID                              : 0xaf00
    Board Part Number                     : 900-2G500-0100-030
    GPU Part Number                       : 1DB4-893-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G500.0200.00.03
        OEM Object                        : 1.1
        ECC Object                        : 5.0
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0xAF
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1DB410DE
        Bus Id                            : 00000000:AF:00.0
        Sub System Id                     : 0x121410DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
                Device Current            : 3
                Device Max                : 3
                Host Max                  : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 16384 MiB
        Reserved                          : 232 MiB
        Used                              : 0 MiB
        Free                              : 13707 MiB
    BAR1 Memory Usage
        Total                             : 16384 MiB
        Used                              : 10 MiB
        Free                              : 16374 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : N/A
        OFA                               : N/A
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : 0
                Total                     : 0
        Aggregate
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : 0
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 34 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 90 C
        GPU Slowdown Temp                 : 87 C
        GPU Max Operating Temp            : 83 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 32 C
        Memory Max Operating Temp         : 85 C
    GPU Power Readings
        Power Draw                        : 38.95 W
        Current Power Limit               : 250.00 W
        Requested Power Limit             : 250.00 W
        Default Power Limit               : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 250.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 1245 MHz
        SM                                : 1245 MHz
        Memory                            : 877 MHz
        Video                             : 1117 MHz
    Applications Clocks
        Graphics                          : 1245 MHz
        Memory                            : 877 MHz
    Default Applications Clocks
        Graphics                          : 1245 MHz
        Memory                            : 877 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1380 MHz
        SM                                : 1380 MHz
        Memory                            : 877 MHz
        Video                             : 1237 MHz
    Max Customer Boost Clocks
        Graphics                          : 1380 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2216157
            Type                          : C
            Name                          : 
            Used GPU Memory               : 426 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2664335
            Type                          : C
            Name                          : 
            Used GPU Memory               : 686 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 3189308
            Type                          : C
            Name                          : 
            Used GPU Memory               : 1328 MiB

[HAMI-core Msg(571:140025334466368:multiprocess_memory_limit.c:468)]: Calling exit handler 571
haitwang-cloud commented 5 months ago

@wawa0210 @archlitchi PTAL

coldzerofear commented 5 months ago

You can use the environment variable LIBCUDA_LOG_LEVEL to increase the logging level of the hami core and obtain more context

haitwang-cloud commented 5 months ago

Append the log after set the LIBCUDA_LOG_LEVEL to 4

(base) (⎈|N/A:N/A)➜   cat output.txt | grep -i error
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlErrorString:2
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceClearEccErrorCounts:10
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetDetailedEccErrors:38
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetMemoryErrorCounter:67
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetNvLinkErrorCounter:75
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetTotalEccErrors:108
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceResetNvLinkErrorCounters:125
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorString
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorName
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorString
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorName
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorString:6000
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorName:6000
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorString:6000
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorName:6000
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
RuntimeError: Failed to find C compiler. Please specify via CC environment variable.
    torch._dynamo.config.suppress_errors = True

output.txt

coldzerofear commented 5 months ago

在将LIBCUDA_LOG_LEVEL``4

(base) (⎈|N/A:N/A)➜   cat output.txt | grep -i error
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlErrorString:2
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceClearEccErrorCounts:10
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetDetailedEccErrors:38
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetMemoryErrorCounter:67
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetNvLinkErrorCounter:75
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetTotalEccErrors:108
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceResetNvLinkErrorCounters:125
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorString
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorName
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorString
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorName
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorString:6000
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorName:6000
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorString:6000
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorName:6000
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
RuntimeError: Failed to find C compiler. Please specify via CC environment variable.
    torch._dynamo.config.suppress_errors = True

output.txt

This looks more like a container environment issue

haitwang-cloud commented 5 months ago

Today, I had an offline debug session with @archlitchi . Despite setting CUDA_DISABLE_CONTROL to true and removing ld.so.preload from the GPU node, the issue persisted. We suspect that this is because Hami is using the v1.4.0 Nvidia device plugin, which may be the reason why nanoGPT cannot run. We need to install a clean Nvidia device plugin v1.4.0 to confirm this. If our suspicion is correct, we might need to upgrade the Nvidia device plugin in Hami.

haitwang-cloud commented 4 months ago

Confirmed that the issue mentioned also occurs in version 0.14.0 of k8s-device-plugin. Thus, we should update k8s-device-plugin to at least version 0.14.5.