NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.26k stars 2.03k forks source link

nvidia-container-cli: mount error by running any nvidia-docker image; "proc/driver/nvidia/gpus/cfdd:00:00.0" bus id is diffferent in mount path and the available bus id is (03f8cfdd:00:00.0). #1316

Closed preethamgali closed 3 years ago

preethamgali commented 4 years ago

1. Issue or feature description

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: stat failed: /proc/driver/nvidia/gpus/cfdd:00:00.0: no such file or directory\\n\\"\"": unknown. similar issue was raised for older version https://github.com/NVIDIA/nvidia-docker/issues/816

2. Steps to reproduce the issue

docker run --gpus all nvidia/cuda:10.0-base nvidia-smi

3. Information to attach (optional if deemed irrelevant)

I0617 13:09:03.865610 6184 nvc.c:281] initializing library context (version=1.1.1, build=e5d6156aba457559979597c8e3d22c5d8d0622db) I0617 13:09:03.865778 6184 nvc.c:255] using root / I0617 13:09:03.865783 6184 nvc.c:256] using ldcache /etc/ld.so.cache I0617 13:09:03.865787 6184 nvc.c:257] using unprivileged user 1002:1002 W0617 13:09:03.868731 6184 nvc.c:171] failed to detect NVIDIA devices W0617 13:09:03.869247 6185 nvc.c:186] failed to set inheritable capabilities W0617 13:09:03.869294 6185 nvc.c:187] skipping kernel modules load due to failure I0617 13:09:03.869879 6186 driver.c:101] starting driver service I0617 13:09:05.245517 6184 nvc_info.c:541] requesting driver information with '' I0617 13:09:05.247331 6184 nvc_info.c:155] selecting /usr/lib64/libnvoptix.so.440.64.00 I0617 13:09:05.247388 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-tls.so.440.64.00 I0617 13:09:05.247420 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-rtcore.so.440.64.00 I0617 13:09:05.247450 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.440.64.00 I0617 13:09:05.247488 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-opticalflow.so.440.64.00 I0617 13:09:05.247515 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-opencl.so.440.64.00 I0617 13:09:05.247544 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-ml.so.440.64.00 I0617 13:09:05.247609 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-ifr.so.440.64.00 I0617 13:09:05.247661 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-glvkspirv.so.440.64.00 I0617 13:09:05.247690 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-glsi.so.440.64.00 I0617 13:09:05.247717 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-glcore.so.440.64.00 I0617 13:09:05.247744 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-fbc.so.440.64.00 I0617 13:09:05.247780 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-fatbinaryloader.so.440.64.00 I0617 13:09:05.247806 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-encode.so.440.64.00 I0617 13:09:05.247842 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-eglcore.so.440.64.00 I0617 13:09:05.247870 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-compiler.so.440.64.00 I0617 13:09:05.247897 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-cfg.so.440.64.00 I0617 13:09:05.247933 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-cbl.so.440.64.00 I0617 13:09:05.247960 6184 nvc_info.c:155] selecting /usr/lib64/libnvidia-allocator.so.440.64.00 I0617 13:09:05.247987 6184 nvc_info.c:155] selecting /usr/lib64/libnvcuvid.so.440.64.00 I0617 13:09:05.248138 6184 nvc_info.c:155] selecting /usr/lib64/libcuda.so.440.64.00 I0617 13:09:05.248219 6184 nvc_info.c:155] selecting /usr/lib64/libGLX_nvidia.so.440.64.00 I0617 13:09:05.248247 6184 nvc_info.c:155] selecting /usr/lib64/libGLESv2_nvidia.so.440.64.00 I0617 13:09:05.248275 6184 nvc_info.c:155] selecting /usr/lib64/libGLESv1_CM_nvidia.so.440.64.00 I0617 13:09:05.248303 6184 nvc_info.c:155] selecting /usr/lib64/libEGL_nvidia.so.440.64.00 W0617 13:09:05.248335 6184 nvc_info.c:306] missing library libvdpau_nvidia.so W0617 13:09:05.248341 6184 nvc_info.c:310] missing compat32 library libnvidia-ml.so W0617 13:09:05.248345 6184 nvc_info.c:310] missing compat32 library libnvidia-cfg.so W0617 13:09:05.248349 6184 nvc_info.c:310] missing compat32 library libcuda.so W0617 13:09:05.248353 6184 nvc_info.c:310] missing compat32 library libnvidia-opencl.so W0617 13:09:05.248357 6184 nvc_info.c:310] missing compat32 library libnvidia-ptxjitcompiler.so W0617 13:09:05.248361 6184 nvc_info.c:310] missing compat32 library libnvidia-fatbinaryloader.so W0617 13:09:05.248365 6184 nvc_info.c:310] missing compat32 library libnvidia-allocator.so W0617 13:09:05.248369 6184 nvc_info.c:310] missing compat32 library libnvidia-compiler.so W0617 13:09:05.248373 6184 nvc_info.c:310] missing compat32 library libvdpau_nvidia.so W0617 13:09:05.248377 6184 nvc_info.c:310] missing compat32 library libnvidia-encode.so W0617 13:09:05.248381 6184 nvc_info.c:310] missing compat32 library libnvidia-opticalflow.so W0617 13:09:05.248385 6184 nvc_info.c:310] missing compat32 library libnvcuvid.so W0617 13:09:05.248389 6184 nvc_info.c:310] missing compat32 library libnvidia-eglcore.so W0617 13:09:05.248393 6184 nvc_info.c:310] missing compat32 library libnvidia-glcore.so W0617 13:09:05.248398 6184 nvc_info.c:310] missing compat32 library libnvidia-tls.so W0617 13:09:05.248402 6184 nvc_info.c:310] missing compat32 library libnvidia-glsi.so W0617 13:09:05.248406 6184 nvc_info.c:310] missing compat32 library libnvidia-fbc.so W0617 13:09:05.248410 6184 nvc_info.c:310] missing compat32 library libnvidia-ifr.so W0617 13:09:05.248414 6184 nvc_info.c:310] missing compat32 library libnvidia-rtcore.so W0617 13:09:05.248418 6184 nvc_info.c:310] missing compat32 library libnvoptix.so W0617 13:09:05.248422 6184 nvc_info.c:310] missing compat32 library libGLX_nvidia.so W0617 13:09:05.248426 6184 nvc_info.c:310] missing compat32 library libEGL_nvidia.so W0617 13:09:05.248430 6184 nvc_info.c:310] missing compat32 library libGLESv2_nvidia.so W0617 13:09:05.248434 6184 nvc_info.c:310] missing compat32 library libGLESv1_CM_nvidia.so W0617 13:09:05.248438 6184 nvc_info.c:310] missing compat32 library libnvidia-glvkspirv.so W0617 13:09:05.248442 6184 nvc_info.c:310] missing compat32 library libnvidia-cbl.so I0617 13:09:05.249007 6184 nvc_info.c:236] selecting /usr/bin/nvidia-smi I0617 13:09:05.249030 6184 nvc_info.c:236] selecting /usr/bin/nvidia-debugdump I0617 13:09:05.249046 6184 nvc_info.c:236] selecting /usr/bin/nvidia-persistenced I0617 13:09:05.249061 6184 nvc_info.c:236] selecting /usr/bin/nvidia-cuda-mps-control I0617 13:09:05.249076 6184 nvc_info.c:236] selecting /usr/bin/nvidia-cuda-mps-server I0617 13:09:05.249098 6184 nvc_info.c:373] listing device /dev/nvidiactl I0617 13:09:05.249102 6184 nvc_info.c:373] listing device /dev/nvidia-uvm I0617 13:09:05.249106 6184 nvc_info.c:373] listing device /dev/nvidia-uvm-tools I0617 13:09:05.249110 6184 nvc_info.c:373] listing device /dev/nvidia-modeset W0617 13:09:05.249586 6184 nvc_info.c:281] missing ipc /var/run/nvidia-persistenced/socket W0617 13:09:05.249866 6184 nvc_info.c:281] missing ipc /tmp/nvidia-mps I0617 13:09:05.249884 6184 nvc_info.c:598] requesting device information with '' I0617 13:09:05.255858 6184 nvc_info.c:637] listing device /dev/nvidia0 (GPU-c9be514f-ce21-66b1-3e93-81bd2147376d at 03f8cfdd:00:00.0) I0617 13:09:05.261703 6184 nvc_info.c:637] listing device /dev/nvidia1 (GPU-cb2fd7b9-a1d8-b47e-b331-2abe34b99a02 at 03f8f31a:00:00.0) NVRM version: 440.64.00 CUDA version: 10.2

Device Index: 0 Device Minor: 0 Model: Tesla P100-PCIE-16GB Brand: Tesla GPU UUID: GPU-c9be514f-ce21-66b1-3e93-81bd2147376d Bus Location: 03f8cfdd:00:00.0 Architecture: 6.0

Device Index: 1 Device Minor: 1 Model: Tesla P100-PCIE-16GB Brand: Tesla GPU UUID: GPU-cb2fd7b9-a1d8-b47e-b331-2abe34b99a02 Bus Location: 03f8f31a:00:00.0 Architecture: 6.0 I0617 13:09:05.261747 6184 nvc.c:318] shutting down library context I0617 13:09:05.798934 6186 driver.c:156] terminating driver service I0617 13:09:05.799328 6184 driver.c:196] driver service terminated successfully


 - [ ] Kernel version from `uname -a`

Linux AZICT00001 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux


 - [ ] Any relevant kernel output lines from `dmesg`

[ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-693.21.1.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Wed Mar 7 19:03:37 UTC 2018 [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.21.1.el7.x86_64 root=UUID=9d77033b-4696-4873-b883-a3124b94db64 ro console=tty1 console=ttyS0,115200n8 earlyprintk=ttyS0,115200 rootdelay=300 net.ifnames=0 nouveau.modeset=0 rd.driver.blacklist=nouveau [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003ffeffff] usable [ 0.000000] BIOS-e820: [mem 0x000000003fff0000-0x000000003fffefff] ACPI data [ 0.000000] BIOS-e820: [mem 0x000000003ffff000-0x000000003fffffff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x0000000fdfffffff] usable [ 0.000000] BIOS-e820: [mem 0x0000004fe0000000-0x00000078bfffffff] usable [ 0.000000] bootconsole [earlyser0] enabled [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] SMBIOS 2.3 present. [ 0.000000] DMI: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017 [ 0.000000] Hypervisor detected: Microsoft HyperV [ 0.000000] HyperV: features 0x2e7f, hints 0x40c2c [ 0.000000] HyperV: LAPIC Timer Frequency: 0x30d40 [ 0.000000] tsc: Marking TSC unstable due to running on Hyper-V [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable [ 0.000000] e820: last_pfn = 0x78c0000 max_arch_pfn = 0x400000000 [ 0.000000] MTRR default type: uncachable [ 0.000000] MTRR fixed ranges enabled: [ 0.000000] 00000-9FFFF write-back [ 0.000000] A0000-DFFFF uncachable [ 0.000000] E0000-FFFFF write-back [ 0.000000] MTRR variable ranges enabled: [ 0.000000] 0 base 00000000000 mask FFFC0000000 write-back [ 0.000000] 1 base 00100000000 mask FF000000000 write-back [ 0.000000] 2 base 04FE0000000 mask 80000000000 write-back [ 0.000000] 3 base 80000000000 mask 00000000000 write-back [ 0.000000] 4 disabled [ 0.000000] 5 disabled [ 0.000000] 6 disabled [ 0.000000] 7 disabled [ 0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106 [ 0.000000] e820: update [mem 0x40000000-0xffffffff] usable ==> reserved [ 0.000000] e820: update [mem 0x1100000000-0x4fdfffffff] usable ==> reserved [ 0.000000] e820: last_pfn = 0x3fff0 max_arch_pfn = 0x400000000 [ 0.000000] found SMP MP-table at [mem 0x000ff780-0x000ff78f] mapped at [ffff8800000ff780] [ 0.000000] Base memory trampoline at [ffff880000099000] 99000 size 24576 [ 0.000000] Using GB pages for direct mapping [ 0.000000] BRK [0x02002000, 0x02002fff] PGTABLE [ 0.000000] BRK [0x02003000, 0x02003fff] PGTABLE [ 0.000000] BRK [0x02004000, 0x02004fff] PGTABLE [ 0.000000] BRK [0x02005000, 0x02005fff] PGTABLE [ 0.000000] BRK [0x02006000, 0x02006fff] PGTABLE [ 0.000000] RAMDISK: [mem 0x35b70000-0x36daffff] [ 0.000000] Early table checksum verification disabled [ 0.000000] ACPI: RSDP 00000000000f5bf0 00014 (v00 ACPIAM) [ 0.000000] ACPI: RSDT 000000003fff0000 00040 (v01 VRTUAL MICROSFT 06001702 MSFT 00000097) [ 0.000000] ACPI: FACP 000000003fff0200 00081 (v02 VRTUAL MICROSFT 06001702 MSFT 00000097) [ 0.000000] ACPI: DSDT 000000003fff1d24 03CBE (v01 MSFTVM MSFTVM02 00000002 INTL 02002026) [ 0.000000] ACPI: FACS 000000003ffff000 00040 [ 0.000000] ACPI: WAET 000000003fff1a80 00028 (v01 VRTUAL MICROSFT 06001702 MSFT 00000097) [ 0.000000] ACPI: SLIC 000000003fff1ac0 00176 (v01 VRTUAL MICROSFT 06001702 MSFT 00000097) [ 0.000000] ACPI: OEM0 000000003fff1cc0 00064 (v01 VRTUAL MICROSFT 06001702 MSFT 00000097) [ 0.000000] ACPI: SRAT 000000003fff0800 001E0 (v02 VRTUAL MICROSFT 00000001 MSFT 00000001) [ 0.000000] ACPI: APIC 000000003fff0300 00452 (v01 VRTUAL MICROSFT 06001702 MSFT 00000097) [ 0.000000] ACPI: OEMB 000000003ffff040 00064 (v01 VRTUAL MICROSFT 06001702 MSFT 00000097) [ 0.000000] ACPI: Local APIC address 0xfee00000 [ 0.000000] SRAT: PXM 0 -> APIC 0x00 -> Node 0 [ 0.000000] SRAT: PXM 0 -> APIC 0x01 -> Node 0 [ 0.000000] SRAT: PXM 0 -> APIC 0x02 -> Node 0 [ 0.000000] SRAT: PXM 0 -> APIC 0x03 -> Node 0 [ 0.000000] SRAT: PXM 0 -> APIC 0x04 -> Node 0 [ 0.000000] SRAT: PXM 0 -> APIC 0x05 -> Node 0 [ 0.000000] SRAT: PXM 0 -> APIC 0x06 -> Node 0 [ 0.000000] SRAT: PXM 0 -> APIC 0x07 -> Node 0 [ 0.000000] SRAT: PXM 0 -> APIC 0x08 -> Node 0 [ 0.000000] SRAT: PXM 0 -> APIC 0x09 -> Node 0 [ 0.000000] SRAT: PXM 0 -> APIC 0x0a -> Node 0 [ 0.000000] SRAT: PXM 0 -> APIC 0x0b -> Node 0 [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x3fffffff] hotplug [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x100000000-0xfdfffffff] hotplug [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x4fe0000000-0x78bfffffff] hotplug [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x78c0200000-0xffffffffff] hotplug [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x10000200000-0x1ffffffffff] hotplug [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x20000200000-0x3ffffffffff] hotplug [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x3fffffff] + [mem 0x100000000-0xfdfffffff] -> [mem 0x00000000-0xfdfffffff] [ 0.000000] NUMA: Node 0 [mem 0x00000000-0xfdfffffff] + [mem 0x4fe0000000-0x78bfffffff] -> [mem 0x00000000-0x78bfffffff] [ 0.000000] NODE_DATA(0) allocated [mem 0x78bffd8000-0x78bfffefff] [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] [ 0.000000] Normal [mem 0x100000000-0x78bfffffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x0009efff] [ 0.000000] node 0: [mem 0x00100000-0x3ffeffff] [ 0.000000] node 0: [mem 0x100000000-0xfdfffffff] [ 0.000000] node 0: [mem 0x4fe0000000-0x78bfffffff] [ 0.000000] Initmem setup node 0 [mem 0x00001000-0x78bfffffff] [ 0.000000] On node 0 totalpages: 58720142 [ 0.000000] DMA zone: 64 pages used for memmap [ 0.000000] DMA zone: 21 pages reserved [ 0.000000] DMA zone: 3998 pages, LIFO batch:0 [ 0.000000] DMA32 zone: 4032 pages used for memmap [ 0.000000] DMA32 zone: 258032 pages, LIFO batch:31 [ 0.000000] Normal zone: 913408 pages used for memmap [ 0.000000] Normal zone: 58458112 pages, LIFO batch:31 [ 0.000000] ACPI: PM-Timer IO Port: 0x408 [ 0.000000] ACPI: Local APIC address 0xfee00000 [ 0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x03] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x05] lapic_id[0x04] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x06] lapic_id[0x05] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x07] lapic_id[0x06] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x08] lapic_id[0x07] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x09] lapic_id[0x08] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x09] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x0a] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x0b] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x0d] lapic_id[0x0c] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x0e] lapic_id[0x0d] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x0f] lapic_id[0x0e] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x10] lapic_id[0x0f] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x11] lapic_id[0x10] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x12] lapic_id[0x11] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x13] lapic_id[0x12] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x14] lapic_id[0x13] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x15] lapic_id[0x14] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x16] lapic_id[0x15] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x17] lapic_id[0x16] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x18] lapic_id[0x17] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x19] lapic_id[0x18] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x1a] lapic_id[0x19] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x1b] lapic_id[0x1a] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x1c] lapic_id[0x1b] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x1d] lapic_id[0x1c] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x1e] lapic_id[0x1d] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x1f] lapic_id[0x1e] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x20] lapic_id[0x1f] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x21] lapic_id[0x20] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x22] lapic_id[0x21] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x23] lapic_id[0x22] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x24] lapic_id[0x23] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x25] lapic_id[0x24] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x26] lapic_id[0x25] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x27] lapic_id[0x26] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x28] lapic_id[0x27] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x29] lapic_id[0x28] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x2a] lapic_id[0x29] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x2b] lapic_id[0x2a] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x2c] lapic_id[0x2b] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x2d] lapic_id[0x2c] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x2e] lapic_id[0x2d] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x2f] lapic_id[0x2e] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x30] lapic_id[0x2f] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x31] lapic_id[0x30] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x32] lapic_id[0x31] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x33] lapic_id[0x32] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x34] lapic_id[0x33] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x35] lapic_id[0x34] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x36] lapic_id[0x35] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x37] lapic_id[0x36] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x38] lapic_id[0x37] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x39] lapic_id[0x38] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x3a] lapic_id[0x39] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x3b] lapic_id[0x3a] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x3c] lapic_id[0x3b] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x3d] lapic_id[0x3c] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x3e] lapic_id[0x3d] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x3f] lapic_id[0x3e] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x40] lapic_id[0x3f] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x41] lapic_id[0x40] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x42] lapic_id[0x41] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x43] lapic_id[0x42] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x44] lapic_id[0x43] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x45] lapic_id[0x44] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x46] lapic_id[0x45] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x47] lapic_id[0x46] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x48] lapic_id[0x47] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x49] lapic_id[0x48] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x4a] lapic_id[0x49] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x4b] lapic_id[0x4a] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x4c] lapic_id[0x4b] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x4d] lapic_id[0x4c] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x4e] lapic_id[0x4d] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x4f] lapic_id[0x4e] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x50] lapic_id[0x4f] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x51] lapic_id[0x50] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x52] lapic_id[0x51] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x53] lapic_id[0x52] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x54] lapic_id[0x53] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x55] lapic_id[0x54] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x56] lapic_id[0x55] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x57] lapic_id[0x56] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x58] lapic_id[0x57] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x59] lapic_id[0x58] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x5a] lapic_id[0x59] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x5b] lapic_id[0x5a] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x5c] lapic_id[0x5b] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x5d] lapic_id[0x5c] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x5e] lapic_id[0x5d] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x5f] lapic_id[0x5e] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x60] lapic_id[0x5f] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x61] lapic_id[0x60] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x62] lapic_id[0x61] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x63] lapic_id[0x62] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x64] lapic_id[0x63] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x65] lapic_id[0x64] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x66] lapic_id[0x65] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x67] lapic_id[0x66] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x68] lapic_id[0x67] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x69] lapic_id[0x68] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x6a] lapic_id[0x69] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x6b] lapic_id[0x6a] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x6c] lapic_id[0x6b] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x6d] lapic_id[0x6c] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x6e] lapic_id[0x6d] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x6f] lapic_id[0x6e] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x70] lapic_id[0x6f] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x71] lapic_id[0x70] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x72] lapic_id[0x71] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x73] lapic_id[0x72] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x74] lapic_id[0x73] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x75] lapic_id[0x74] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x76] lapic_id[0x75] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x77] lapic_id[0x76] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x78] lapic_id[0x77] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x79] lapic_id[0x78] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x7a] lapic_id[0x79] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x7b] lapic_id[0x7a] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x7c] lapic_id[0x7b] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x7d] lapic_id[0x7c] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x7e] lapic_id[0x7d] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x7f] lapic_id[0x7e] disabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x80] lapic_id[0x7f] disabled) [ 0.000000] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1]) [ 0.000000] ACPI: IOAPIC (id[0x00] address[0xfec00000] gsi_base[0]) [ 0.000000] IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23 [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) [ 0.000000] ACPI: IRQ0 used by override. [ 0.000000] ACPI: IRQ9 used by override. [ 0.000000] Using ACPI (MADT) for SMP configuration information [ 0.000000] smpboot: Allowing 128 CPUs, 116 hotplug CPUs [ 0.000000] PM: Registered nosave memory: [mem 0x0009f000-0x0009ffff] [ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000dffff] [ 0.000000] PM: Registered nosave memory: [mem 0x000e0000-0x000fffff] [ 0.000000] PM: Registered nosave memory: [mem 0x3fff0000-0x3fffefff] [ 0.000000] PM: Registered nosave memory: [mem 0x3ffff000-0x3fffffff] [ 0.000000] PM: Registered nosave memory: [mem 0x40000000-0xffffffff] [ 0.000000] PM: Registered nosave memory: [mem 0xfe0000000-0x4fdfffffff] [ 0.000000] e820: [mem 0x40000000-0xffffffff] available for PCI devices [ 0.000000] Booting paravirtualized kernel on bare hardware [ 0.000000] setup_percpu: NR_CPUS:5120 nr_cpumask_bits:128 nr_cpu_ids:128 nr_node_ids:1 [ 0.000000] PERCPU: Embedded 35 pages/cpu @ffff8877dd600000 s104600 r8192 d30568 u262144 [ 0.000000] pcpu-alloc: s104600 r8192 d30568 u262144 alloc=1*2097152 [ 0.000000] pcpu-alloc: [0] 000 001 002 003 004 005 006 007 [ 0.000000] pcpu-alloc: [0] 008 009 010 011 012 013 014 015 [ 0.000000] pcpu-alloc: [0] 016 017 018 019 020 021 022 023 [ 0.000000] pcpu-alloc: [0] 024 025 026 027 028 029 030 031 [ 0.000000] pcpu-alloc: [0] 032 033 034 035 036 037 038 039 [ 0.000000] pcpu-alloc: [0] 040 041 042 043 044 045 046 047 [ 0.000000] pcpu-alloc: [0] 048 049 050 051 052 053 054 055 [ 0.000000] pcpu-alloc: [0] 056 057 058 059 060 061 062 063 [ 0.000000] pcpu-alloc: [0] 064 065 066 067 068 069 070 071 [ 0.000000] pcpu-alloc: [0] 072 073 074 075 076 077 078 079 [ 0.000000] pcpu-alloc: [0] 080 081 082 083 084 085 086 087 [ 0.000000] pcpu-alloc: [0] 088 089 090 091 092 093 094 095 [ 0.000000] pcpu-alloc: [0] 096 097 098 099 100 101 102 103 [ 0.000000] pcpu-alloc: [0] 104 105 106 107 108 109 110 111 [ 0.000000] pcpu-alloc: [0] 112 113 114 115 116 117 118 119 [ 0.000000] pcpu-alloc: [0] 120 121 122 123 124 125 126 127 [ 0.000000] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 57802617 [ 0.000000] Policy zone: Normal [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.21.1.el7.x86_64 root=UUID=9d77033b-4696-4873-b883-a3124b94db64 ro console=tty1 console=ttyS0,115200n8 earlyprintk=ttyS0,115200 rootdelay=300 net.ifnames=0 nouveau.modeset=0 rd.driver.blacklist=nouveau [ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)

[ 0.000000] xsave: enabled xstate_bv 0x7, cntxt size 0x340 using standard form [ 0.000000] Memory: 5010572k/506462208k available (6940k kernel code, 271581640k absent, 3789744k reserved, 4560k data, 1792k init) [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=128, Nodes=1 [ 0.000000] x86/pti: Unmapping kernel while in userspace [ 0.000000] Hierarchical RCU implementation. [ 0.000000] RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=128. [ 0.000000] NR_IRQS:327936 nr_irqs:1448 0 [ 0.000000] Console: colour VGA+ 80x25 [ 0.000000] console [tty1] enabled [ 0.000000] console [ttyS0] enabled, bootconsole disabled [ 0.000000] allocated 939524096 bytes of page_cgroup [ 0.000000] please try 'cgroup_disable=memory' option if you don't want memory cgroups [ 0.000000] tsc: Fast TSC calibration failed [ 0.000000] tsc: Unable to calibrate against PIT [ 0.000000] tsc: using PMTIMER reference calibration [ 0.000000] tsc: Detected 2593.971 MHz processor [ 0.007046] Calibrating delay loop (skipped), value calculated using timer frequency.. 5187.94 BogoMIPS (lpj=2593971) [ 0.009016] pid_max: default: 131072 minimum: 1024 [ 0.010161] Security Framework initialized [ 0.011021] SELinux: Initializing. [ 0.012040] SELinux: Starting in permissive mode [ 0.012041] Yama: becoming mindful. [ 0.028313] Dentry cache hash table entries: 33554432 (order: 16, 268435456 bytes) [ 0.073342] Inode-cache hash table entries: 16777216 (order: 15, 134217728 bytes) [ 0.091752] Mount-cache hash table entries: 524288 (order: 10, 4194304 bytes) [ 0.092231] Mountpoint-cache hash table entries: 524288 (order: 10, 4194304 bytes) [ 0.094947] Initializing cgroup subsys memory [ 0.095036] Initializing cgroup subsys devices [ 0.096019] Initializing cgroup subsys freezer [ 0.097016] Initializing cgroup subsys net_cls [ 0.098015] Initializing cgroup subsys blkio [ 0.099015] Initializing cgroup subsys perf_event [ 0.100024] Initializing cgroup subsys hugetlb [ 0.101016] Initializing cgroup subsys pids [ 0.102015] Initializing cgroup subsys net_prio [ 0.103102] CPU: Physical Processor ID: 0 [ 0.104015] CPU: Processor Core ID: 0 [ 0.105743] mce: CPU supports 1 MCE banks [ 0.106037] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 [ 0.107014] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0 [ 0.108015] tlb_flushall_shift: 6 [ 0.109014] FEATURE SPEC_CTRL Not Present [ 0.110015] FEATURE IBPB_SUPPORT Not Present [ 0.111145] Spectre V2 : Vulnerable: Retpoline without IBPB [ 0.113547] Freeing SMP alternatives: 24k freed [ 0.118672] ACPI: Core revision 20130517 [ 0.120368] ACPI: All ACPI Tables successfully acquired [ 0.122763] ftrace: allocating 26646 entries in 105 pages [ 0.137352] Switched APIC routing to physical flat. [ 0.156000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 [ 0.156001] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz (fam: 06, model: 4f, stepping: 01) [ 0.160140] Performance Events: unsupported p6 CPU model 79 no PMU driver, software events only. [ 0.165904] NMI watchdog: disabled (cpu0): hardware events not enabled [ 0.166001] NMI watchdog: Shutting down hard lockup detector on all cpus [ 0.167080] smpboot: Booting Node 0, Processors #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 [ 0.188583] Brought up 12 CPUs [ 0.190001] smpboot: Max logical packages: 11 [ 0.191002] smpboot: Total of 12 processors activated (62255.30 BogoMIPS) [ 1.038583] node 0 initialised, 56519037 pages in 842ms [ 1.042129] devtmpfs: initialized [ 1.055964] EVM: security.selinux [ 1.056001] EVM: security.ima [ 1.057001] EVM: security.capability [ 1.058049] PM: Registering ACPI NVS region [mem 0x3ffff000-0x3fffffff] (4096 bytes) [ 1.061022] atomic64 test passed for x86-64 platform with CX8 and with SSE [ 1.062003] pinctrl core: initialized pinctrl subsystem [ 1.083687] RTC time: 12:02:20, date: 06/17/20 [ 1.084132] NET: Registered protocol family 16 [ 1.085172] ACPI: bus type PCI registered [ 1.086005] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 [ 1.087335] PCI: Using configuration type 1 for base access [ 1.094098] ACPI: Added _OSI(Module Device) [ 1.095002] ACPI: Added _OSI(Processor Device) [ 1.096001] ACPI: Added _OSI(3.0 _SCP Extensions) [ 1.097001] ACPI: Added _OSI(Processor Aggregator Device) [ 1.100003] ACPI: EC: Look up EC in DSDT [ 1.101636] ACPI: Interpreter enabled [ 1.102008] ACPI: (supports S0 S5) [ 1.103001] ACPI: Using IOAPIC for interrupt routing [ 1.104017] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug [ 1.116514] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff]) [ 1.117005] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments MSI] [ 1.118004] acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM [ 1.119027] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge. [ 1.120089] PCI host bridge to bus 0000:00 [ 1.121003] pci_bus 0000:00: root bus resource [bus 00-ff] [ 1.122002] pci_bus 0000:00: root bus resource [mem 0xfe0000000-0x4fdfffffff window] [ 1.123002] pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7 window] [ 1.124003] pci_bus 0000:00: root bus resource [io 0x0d00-0xffff window] [ 1.125002] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window] [ 1.126002] pci_bus 0000:00: root bus resource [mem 0x40000000-0xfffbffff window] [ 1.128003] pci 0000:00:00.0: [8086:7192] type 00 class 0x060000 [ 1.129363] pci 0000:00:07.0: [8086:7110] type 00 class 0x060100 [ 1.131196] pci 0000:00:07.1: [8086:7111] type 00 class 0x010180 [ 1.132497] pci 0000:00:07.1: reg 0x20: [io 0xffa0-0xffaf] [ 1.133086] pci 0000:00:07.1: legacy IDE quirk: reg 0x10: [io 0x01f0-0x01f7] [ 1.134002] pci 0000:00:07.1: legacy IDE quirk: reg 0x14: [io 0x03f6] [ 1.135003] pci 0000:00:07.1: legacy IDE quirk: reg 0x18: [io 0x0170-0x0177] [ 1.136001] pci 0000:00:07.1: legacy IDE quirk: reg 0x1c: [io 0x0376] [ 1.137375] pci 0000:00:07.3: [8086:7113] type 00 class 0x068000 [ 1.137405] * Found PM-Timer Bug on the chipset. Due to workarounds for a bug,

==============NVSMI LOG==============

Timestamp                           : Wed Jun 17 13:18:19 2020
Driver Version                      : 440.64.00
CUDA Version                        : 10.2

Attached GPUs                       : 2
GPU 03F8CFDD:00:00.0
    Product Name                    : Tesla P100-PCIE-16GB
    Product Brand                   : Tesla
    Display Mode                    : Enabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0322617142773
    GPU UUID                        : GPU-c9be514f-ce21-66b1-3e93-81bd2147376d
    Minor Number                    : 0
    VBIOS Version                   : 86.00.41.00.06
    MultiGPU Board                  : No
    Board ID                        : 0xcfdd0000
    GPU Part Number                 : 900-2H400-0000-000
    Inforom Version
        Image Version               : H400.0201.00.08
        OEM Object                  : 1.1
        ECC Object                  : 4.1
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization Mode         : Pass-Through
        Host VGPU Mode              : N/A
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x00
        Device                      : 0x00
        Domain                      : 0x3F8CFDD
        Device Id                   : 0x15F810DE
        Bus Id                      : 03F8CFDD:00:00.0
        Sub System Id               : 0x118F10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays Since Reset         : 0
        Replay Number Rollovers     : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 16280 MiB
        Used                        : 0 MiB
        Free                        : 16280 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : Disabled
        Pending                     : Disabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
        Aggregate
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending Page Blacklist      : No
    Temperature
        GPU Current Temp            : 28 C
        GPU Shutdown Temp           : 85 C
        GPU Slowdown Temp           : 82 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 33.77 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 1189 MHz
        SM                          : 1189 MHz
        Memory                      : 715 MHz
        Video                       : 1063 MHz
    Applications Clocks
        Graphics                    : 1189 MHz
        Memory                      : 715 MHz
    Default Applications Clocks
        Graphics                    : 1189 MHz
        Memory                      : 715 MHz
    Max Clocks
        Graphics                    : 1328 MHz
        SM                          : 1328 MHz
        Memory                      : 715 MHz
        Video                       : 1328 MHz
    Max Customer Boost Clocks
        Graphics                    : 1328 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

GPU 03F8F31A:00:00.0
    Product Name                    : Tesla P100-PCIE-16GB
    Product Brand                   : Tesla
    Display Mode                    : Enabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0322617143819
    GPU UUID                        : GPU-cb2fd7b9-a1d8-b47e-b331-2abe34b99a02
    Minor Number                    : 1
    VBIOS Version                   : 86.00.41.00.06
    MultiGPU Board                  : No
    Board ID                        : 0xf31a0000
    GPU Part Number                 : 900-2H400-0000-000
    Inforom Version
        Image Version               : H400.0201.00.08
        OEM Object                  : 1.1
        ECC Object                  : 4.1
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization Mode         : Pass-Through
        Host VGPU Mode              : N/A
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x00
        Device                      : 0x00
        Domain                      : 0x3F8F31A
        Device Id                   : 0x15F810DE
        Bus Id                      : 03F8F31A:00:00.0
        Sub System Id               : 0x118F10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays Since Reset         : 0
        Replay Number Rollovers     : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 16280 MiB
        Used                        : 0 MiB
        Free                        : 16280 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 1 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : Disabled
        Pending                     : Disabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
        Aggregate
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending Page Blacklist      : No
    Temperature
        GPU Current Temp            : 26 C
        GPU Shutdown Temp           : 85 C
        GPU Slowdown Temp           : 82 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 31.84 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 1189 MHz
        SM                          : 1189 MHz
        Memory                      : 715 MHz
        Video                       : 1075 MHz
    Applications Clocks
        Graphics                    : 1189 MHz
        Memory                      : 715 MHz
    Default Applications Clocks
        Graphics                    : 1189 MHz
        Memory                      : 715 MHz
    Max Clocks
        Graphics                    : 1328 MHz
        SM                          : 1328 MHz
        Memory                      : 715 MHz
        Video                       : 1328 MHz
    Max Customer Boost Clocks
        Graphics                    : 1328 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

Server: Docker Engine - Community Engine: Version: 19.03.8 API version: 1.40 (minimum version 1.12) Go version: go1.12.17 Git commit: afacb8b Built: Wed Mar 11 01:25:42 2020 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.2.13 GitCommit: 7ad184331fa3e55e52b890ea95e65ba581ae3429 runc: Version: 1.0.0-rc10 GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd docker-init: Version: 0.18.0 GitCommit: fec3683


 - [ ] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

nvidia-xconfig-latest-440.64.00-1.el7.x86_64 nvidia-persistenced-latest-440.64.00-1.el7.x86_64 libnvidia-container-tools-1.1.1-1.x86_64 nvidia-driver-latest-libs-440.64.00-1.el7.x86_64 libnvidia-container1-1.1.1-1.x86_64 nvidia-driver-latest-NVML-440.64.00-1.el7.x86_64 nvidia-driver-latest-cuda-libs-440.64.00-1.el7.x86_64 nvidia-driver-latest-440.64.00-1.el7.x86_64 nvidia-libXNVCtrl-devel-440.64.00-1.el7.x86_64 nvidia-settings-440.64.00-1.el7.x86_64 nvidia-modprobe-latest-440.64.00-1.el7.x86_64 kmod-nvidia-latest-dkms-440.64.00-1.el7.x86_64 nvidia-driver-latest-cuda-440.64.00-1.el7.x86_64 nvidia-container-toolkit-1.1.2-2.x86_64 nvidia-driver-latest-devel-440.64.00-1.el7.x86_64 yum-plugin-nvidia-0.5-1.el7.noarch nvidia-driver-latest-NvFBCOpenGL-440.64.00-1.el7.x86_64 nvidia-libXNVCtrl-440.64.00-1.el7.x86_64


 - [ ] NVIDIA container library version from `nvidia-container-cli -V`

version: 1.1.1 build date: 2020-05-19T15:16+0000 build revision: e5d6156aba457559979597c8e3d22c5d8d0622db build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-39) build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

aelkayesh commented 4 years ago

Hello @preethamgali

I am facing exactly the same issue, could you please tell me if you found a solution for it?

Thank you

klueska commented 4 years ago

This is highly unexpected, as you can clearly see that libnvidia-container is picking up the correct bus id and storing it internally. Something must be clipping off the top 16 bits before attempting the mount.

Glancing at the code, this could happen if your setup tricked the loop below to think you had a 16-bit PCI domain and not a 32-bit one: https://github.com/NVIDIA/libnvidia-container/blob/v1.1.1/src/nvc_mount.c#L303

It's unclear why this is being triggered though, since you clearly have a 32-bit domain.

klueska commented 4 years ago

What does the contents of /proc/driver/nvidia/gpus/look like on the host?

klueska commented 4 years ago

Did you get a chance to look into what I asked before?

iceisfun commented 3 years ago

I have the same issue.

Also

find /proc/driver/nvidia/gpus/ /proc/driver/nvidia/gpus/ /proc/driver/nvidia/gpus/0000:01:00.0 /proc/driver/nvidia/gpus/0000:01:00.0/power /proc/driver/nvidia/gpus/0000:01:00.0/registry /proc/driver/nvidia/gpus/0000:01:00.0/information /proc/driver/nvidia/gpus/0000:41:00.0 /proc/driver/nvidia/gpus/0000:41:00.0/power /proc/driver/nvidia/gpus/0000:41:00.0/registry /proc/driver/nvidia/gpus/0000:41:00.0/information /proc/driver/nvidia/gpus/0000:81:00.0 /proc/driver/nvidia/gpus/0000:81:00.0/power /proc/driver/nvidia/gpus/0000:81:00.0/registry /proc/driver/nvidia/gpus/0000:81:00.0/information /proc/driver/nvidia/gpus/0000:c1:00.0 /proc/driver/nvidia/gpus/0000:c1:00.0/power /proc/driver/nvidia/gpus/0000:c1:00.0/registry /proc/driver/nvidia/gpus/0000:c1:00.0/information

klueska commented 3 years ago

Since this issue is now a year old and I have not gotten a response to my question about: https://github.com/NVIDIA/nvidia-docker/issues/1316#issuecomment-713941554

I am closing this issue as stale. Please reopen / post a new one if you encounter this problem again and can provide me the info I need to debug it.