NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.49k stars 269 forks source link

Missing mount og nvoptix.bin from libnvidia-gl-535 #127

Open agirault opened 1 year ago

agirault commented 1 year ago

Enabling Optix denoise requires the /usr/share/nvidia/nvoptix.bin file which is installed as part of libnvidia-gl-<ver> package but not present in containers with nvidia ctk runtime.

Workaround for Holoscan: https://github.com/nvidia-holoscan/holohub/pull/112/files

Content of libnvidia-gl-535

dpkg -L libnvidia-gl-535 | xargs -I % sh -c '[ -f "%" ] && echo "%"'

Files not mounted with nvidia runtime

Run this command to test:

nv_gl_files=$(dpkg -L libnvidia-gl-535 | xargs -I % sh -c '[ -f "%" ] && echo "%"')
docker run -it --rm \
  --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=all --gpus=all \
  -e FILES="$nv_gl_files" \
  ubuntu:22.04 \
  bash -c '
    for file in $FILES; do
      [ ! -f "$file" ] && echo "Missing: $file"
    done
'

Observations

  1. Why dll files on x86_64? /wine/nvngx.dll. Interestingly, there is no libnvidia-ngx.so.1 on x86_64 (vs aarch64).
  2. The missing nvidia-ngx-updater, libnvidia-api.so.1 and libnvidia-vulkan-producer.so.535 only exist on x86_64. Expected ? Need mounting?
  3. libnvidia-egl-gbm.so exist for both x86_64 and aarch64, but missing only in aarch64 containers.
  4. nvidia_layers.json is in icd.d on aarch64, instead of implicit_layer.d in x86_64. The former isn't mounted, while the latter is.
agirault commented 1 year ago

cc @AndreasHeumann @jjomier

elezar commented 1 year ago

@agirault thanks for reporting this. Looking at the list of files, I think adding the following is relatively straightforward:

The following (for aarch64) is also not really a problem:

With regards to the libnvidia-egl-gbm.so file. Since the file actually included in the driver installation is libnvidia-egl-gbm.so.1.1.0 it would be good to understand which symlinks on the host (in either case) point to this file.

The same is required for libnvidia-api.so.1. Here it's key to know what this points to on the host -- since it's expected to be a symbolic link.

elezar commented 1 year ago

I have created https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/501 to add the processing of these files. If we can settle on a final list of missing ones that should be included we can get that in to an upcoming release candidate.

elezar commented 11 months ago

We have just released https://github.com/NVIDIA/nvidia-container-toolkit/releases/tag/v1.15.0-rc.1 that includes the injection of the nvoptix.bin file. The packages are available from our public experimental repositories.

Assuming these have been configured running:

sudo apt-get install -y \
    nvidia-container-toolkit=1.15.0~rc.1-1 \
    nvidia-container-toolkit-base=1.15.0~rc.1-1  \
    libnvidia-container-tools=1.15.0~rc.1-1 \
    libnvidia-container1=1.15.0~rc.1-1

should install the required packages.

elezar commented 10 months ago

Note that we have backported these changes to the release-0.14 branch and they are included in the v1.14.4 release.

@agirault if you get a chance to validate what is still missing that would be great.

turowicz commented 7 months ago

I think the issue is back with latest

apt list --installed | grep nvidia-container-toolkit

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

nvidia-container-toolkit-base/unknown,now 1.15.0-1 amd64 [installed,automatic]
nvidia-container-toolkit/unknown,now 1.15.0-1 amd64 [installed,automatic]

I recently started getting in Omniverse docker container for Isaac Sim:

Could not open optix denoiser weights file "/usr/share/nvidia/nvoptix.bin"
elezar commented 7 months ago

@turowicz first, could you confirm that the file exists on your host?

Then, which docker command are you running? Could you confirm that you are using the nvidia runtime and that the image has NVIDIA_DRIVER_CAPABILITIES=all set (alternatively add -e NVIDIA_DRIVER_CAPABILITIES=all to your docker command line).

The nvoptix.bin file is only injected if NVIDIA_DRIVER_CAPABILITIES include graphics or display.

turowicz commented 7 months ago

yes, to fix the error I had to -v /usr/share/nvidia/nvoptix.bin:/usr/share/nvidia/nvoptix.bin I am using nvidia runtime through --gpus all I don't use -e NVIDIA_DRIVER_CAPABILITIES=all and I have never used it. It used to work fine without it.

turowicz commented 7 months ago

updated my answer above

turowicz commented 7 months ago

addon: I am using nvcr.io/nvidia/isaac-sim:2023.1.1 and it used to work fine.

turowicz commented 7 months ago

I confirm the container nvcr.io/nvidia/isaac-sim:2023.1.1 has NVIDIA_DRIVER_CAPABILITIES=all

elezar commented 7 months ago

@turowicz could you provide the full docker command you run?

turowicz commented 7 months ago

Here's the .devcontainer file:

// See https://aka.ms/vscode-remote/containers for the
// documentation about the devcontainer.json format
{
    "name": "surveily.omniverse",
    "build": {
        "dockerfile": "dockerfile"
    },
    "runArgs": [
        "--name",
        "surveily.omniverse",
        "-v",
        "${env:HOME}${env:USERPROFILE}/.ssh:/root/.ssh-localhost:ro",
        "-v",
        "/var/run/docker.sock:/var/run/docker.sock",
        "-v",
        "/usr/share/nvidia/nvoptix.bin:/usr/share/nvidia/nvoptix.bin",
        "--network",
        "host",
        "--gpus",
        "all",
        "-e",
        "ACCEPT_EULA=Y",
        "-e",
        "PRIVACY_CONSENT=N"
    ],
    "postCreateCommand": "mkdir -p ~/.ssh && cp -r ~/.ssh-localhost/* ~/.ssh && chmod 700 ~/.ssh && chmod 600 ~/.ssh/*",
    "appPort": [
        "5003:5003"
    ],
    "extensions": [
        "kosunix.guid",
        "redhat.vscode-yaml",
        "rogalmic.bash-debug",
        "mikeburgh.xml-format",
        "donjayamanne.githistory",
        "ms-azuretools.vscode-docker",
        "ms-azure-devops.azure-pipelines",
    ],
    "settings": {
        "extensions.autoUpdate": false,
        "files.exclude": {
            "**/CVS": true,
            "**/bin": true,
            "**/obj": true,
            "**/.hg": true,
            "**/.svn": true,
            "**/.git": true,
            "**/.DS_Store": true,
            "**/BenchmarkDotNet.Artifacts": true
        }
    },
    "shutdownAction": "stopContainer",
}

and the dockerfile:

FROM nvcr.io/nvidia/isaac-sim:2023.1.1

# Install tools
RUN apt update && apt install git vim -y

# Remove ROS/2 Bridge
RUN sed -i 's/ros_bridge_extension = "omni.isaac.ros2_bridge"/ros_bridge_extension = ""/g' /isaac-sim/apps/omni.isaac.sim.base.kit

# Toggle Grid Off
RUN sed -i '17i import omni.kit.viewport' /isaac-sim/extscache/omni.replicator.replicator_yaml-2.0.4+lx64/omni/replicator/replicator_yaml/scripts/replicator_yaml_extension.py
RUN sed -i '100i \ \ \ \ \ \ \ \ omni.kit.viewport.actions.actions.toggle_global_visibility(visible=False)' /isaac-sim/extscache/omni.replicator.replicator_yaml-2.0.4+lx64/omni/replicator/replicator_yaml/scripts/replicator_yaml_extension.py
turowicz commented 6 months ago

my workaround works but you guys may want to fix the problem