NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.47k stars 266 forks source link

Incompatibility issues in AWS H100 #761

Open perezpaznoemi opened 2 weeks ago

perezpaznoemi commented 2 weeks ago

Hi We are facing an issue with incompatibility and we have been trying different UBUNTU versions. If I riun hello word in docker works and CUDA, took kit and drivers seem ok. I checked the libraries and those were fine (libnvidia,ml.so.1) however OCI runtime file. Any idea?

ubuntu@ip-172-31-17-183:~$ docker run --gpus all -d -p 80:80 -e HF_TOKEN=ZXXXX767398115161.dkr.ecr.us-east-1.amazonaws.com/predictionaws3:latest 7ac5d43c43301058d56b098d19ab6f36683d1bd617361e677a4b4acc77be3cf3 docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown. ubuntu@ip-172-31-17-183:~$ nvidia-smi Tue Oct 29 05:49:19 2024
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 | | N/A 20C P8 10W / 70W | 1MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ ubuntu@ip-172-31-17-183:~$ docker run hello-world

Hello from Docker! This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:

  1. The Docker client contacted the Docker daemon.
  2. The Docker daemon pulled the "hello-world" image from the Docker Hub.

Docker container

libraries:libnvidia-ml.so, libnvidia-ml.so.1, libnvidia-ml.so.535.183.01, libnvidia-ml.so.550.127.05

FROM docker.io/nvidia/cuda:12.4.0-runtime-ubuntu20.04

Install Python and pip

RUN apt-get update && \ apt-get install -y python3 python3-pip && \ apt-get clean && \ rm -rf /var/lib/apt/lists/*

Set the working directory

WORKDIR /data

Copy input files and scripts

COPY md/1_medical.docx /data/input/ COPY md/1_genetic.csv /data/input/ COPY scripts/aws_md.py /data/scripts/ COPY requirements.txt /data/

Install required Python packages

RUN pip3 install --no-cache-dir -r requirements.txt

Set environment variables for input files (if needed)

(amd64)
elezar commented 2 days ago

@perezpaznoemi what is the output of:

nvidia-ctk --version