Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
874 stars 183 forks source link

Segmentation fault #461

Open hxh71 opened 2 months ago

hxh71 commented 2 months ago

1. Issue or feature description

只要带nvidia.com/gpu相关配置启动pod, 然后进入bash, 输入nvidia-smi ,就会报 Segmentation fault

2. Steps to reproduce the issue

  1. 使用 下面命令安装hami helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.30.2 -n kube-system

  2. 然后创建pod

    apiVersion: v1
    kind: Pod
    metadata:
    name: gpu-pod
    spec:
    containers:
    - name: gpu-pod
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 2 vGPUs
          nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory (Optional,Integer)
  3. 进入bash,输入nvidia-smi

# nvidia-smi
Segmentation fault
  1. 但是使用nvidia 提供的 k8s-device-plugin,使用如下命令安装nvidia-device-plugin kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml 然后启动pod, 就能使用nvidia-smi命令查看gpu信息

3. Information to attach (optional if deemed irrelevant)

system:

WSL 版本: 2.2.4.0
内核版本: 5.15.153.1-2
WSLg 版本: 1.0.61
MSRDC 版本: 1.2.5326
Direct3D 版本: 1.611.1-81528511
DXCore 版本: 10.0.26091.1-240325-1447.ge-release
Windows 版本: 10.0.19045.2788

docker version:

Client:
 Version:           27.1.1
 API version:       1.46
 Go version:        go1.21.12
 Git commit:        6312585
 Built:             Tue Jul 23 19:57:57 2024
 OS/Arch:           windows/amd64
 Context:           desktop-linux

Server: Docker Desktop 4.33.1 (161083)
 Engine:
  Version:          27.1.1
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.12
  Git commit:       cc13f95
  Built:            Tue Jul 23 19:57:19 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.19
  GitCommit:        2bf793ef6dc9a18e00cb12efb64355c2c9d5eb41
 nvidia:
  Version:          1.7.19
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

kube version:

Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.2

nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.67                 Driver Version: 536.67       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4060      WDDM  | 00000000:01:00.0  On |                  N/A |
|  0%   50C    P8              52W / 115W |    588MiB /  8188MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2192    C+G   ...\Docker\frontend\Docker Desktop.exe    N/A      |
|    0   N/A  N/A      2232    C+G   ...5n1h2txyewy\ShellExperienceHost.exe    N/A      |
|    0   N/A  N/A     10952    C+G   C:\Windows\explorer.exe                   N/A      |
|    0   N/A  N/A     11656    C+G   ...GeForce Experience\NVIDIA Share.exe    N/A      |
|    0   N/A  N/A     12208    C+G   ...2txyewy\StartMenuExperienceHost.exe    N/A      |
|    0   N/A  N/A     12664    C+G   ...CBS_cw5n1h2txyewy\TextInputHost.exe    N/A      |
|    0   N/A  N/A     13256    C+G   ...oogle\Chrome\Application\chrome.exe    N/A      |
|    0   N/A  N/A     14808    C+G   ....Search_cw5n1h2txyewy\SearchApp.exe    N/A      |
+---------------------------------------------------------------------------------------+

nvidia-smi -a:

==============NVSMI LOG==============

Timestamp                                 : Sun Sep  1 02:31:59 2024
Driver Version                            : 536.67
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 4060
    Product Brand                         : GeForce
    Product Architecture                  : Ada Lovelace
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : N/A
    Addressing Mode                       : N/A
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : WDDM
        Pending                           : WDDM
    Serial Number                         : N/A
    GPU UUID                              : GPU-9ed524ab-a2c9-5c44-d19d-1ca63d9a8e02
    Minor Number                          : N/A
    VBIOS Version                         : 95.07.31.00.5f
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : N/A
    GPU Part Number                       : 2882-400-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G002.0000.00.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x288210DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x21007377
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 5
            Link Width
                Max                       : 8x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 23000 KB/s
        Rx Throughput                     : 4000 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 8188 MiB
        Reserved                          : 217 MiB
        Used                              : 706 MiB
        Free                              : 7264 MiB
    BAR1 Memory Usage
        Total                             : 8192 MiB
        Used                              : 1 MiB
        Free                              : 8191 MiB
    Conf Compute Protected Memory Usage
        Total                             : N/A
        Used                              : N/A
        Free                              : N/A
    Compute Mode                          : Default
    Utilization
        Gpu                               : 2 %
        Memory                            : 9 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 64 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 54 C
        GPU T.Limit Temp                  : 28 C
        GPU Shutdown T.Limit Temp         : -7 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A
    GPU Power Readings
        Power Draw                        : 52.70 W
        Current Power Limit               : 115.00 W
        Requested Power Limit             : 115.00 W
        Default Power Limit               : 115.00 W
        Min Power Limit                   : 90.00 W
        Max Power Limit                   : 125.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 1185 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 3105 MHz
        SM                                : 3105 MHz
        Memory                            : 8501 MHz
        Video                             : 2415 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 875.000 mV
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 4
            Type                          : C+G
            Name                          :
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1384
            Type                          : C+G
            Name                          :
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2192
            Type                          : C+G
            Name                          : C:\Program Files\Docker\Docker\frontend\Docker Desktop.exe
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2232
            Type                          : C+G
            Name                          : C:\Windows\SystemApps\ShellExperienceHost_cw5n1h2txyewy\ShellExperienceHost.exe
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 10952
            Type                          : C+G
            Name                          : C:\Windows\explorer.exe
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 11656
            Type                          : C+G
            Name                          : C:\Program Files\NVIDIA Corporation\NVIDIA GeForce Experience\NVIDIA Share.exe
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 12208
            Type                          : C+G
            Name                          : C:\Windows\SystemApps\Microsoft.Windows.StartMenuExperienceHost_cw5n1h2txyewy\StartMenuExperienceHost.exe
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 12664
            Type                          : C+G
            Name                          : C:\Windows\SystemApps\MicrosoftWindows.Client.CBS_cw5n1h2txyewy\TextInputHost.exe
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 12816
            Type                          : C+G
            Name                          : D:\feishu\app\Feishu.exe
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 13256
            Type                          : C+G
            Name                          : C:\Program Files\Google\Chrome\Application\chrome.exe
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 14808
            Type                          : C+G
            Name                          : C:\Windows\SystemApps\Microsoft.Windows.Search_cw5n1h2txyewy\SearchApp.exe
            Used GPU Memory               : Not available in WDDM driver model

hami containers: 20240901-022741

archlitchi commented 2 months ago

HAMi does not support windows yet