When running cuda container with v2 I get ECC error from GPU driver

omerh commented 6 years ago

I am using ubuntu 16.04 in AWS instance type is p2.xl

NVIDIA driver:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   45C    P0    69W / 149W |  10450MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3701      C   python                                     10436MiB |
+-----------------------------------------------------------------------------+

Error I get after running docker command with nvidia runtime

I get driver error on ECC

[  374.665347] NVRM: Xid (PCI:0000:00:1e): 48, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 0, subpartition 1.
[  374.675104] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000010, engmsk 00000100
[  374.695751] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000010, engmsk 00000100
[  374.716755] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000011, engmsk 00000100
[  374.737608] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000012, engmsk 00000100
[  374.758500] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000013, engmsk 00000100
[  374.779381] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000014, engmsk 00000100
[  374.800005] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000015, engmsk 00000100
[  374.820647] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000016, engmsk 00000100
[  374.838173] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000017, engmsk 00000100
[  374.860733] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000011, engmsk 00000100
[  374.877895] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000012, engmsk 00000100
[  374.953751] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000013, engmsk 00000100
[  374.974460] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000014, engmsk 00000100
[  374.995350] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000015, engmsk 00000100
[  375.017168] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000016, engmsk 00000100
[  375.038242] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000017, engmsk 00000100
[  376.421044] NVRM: Xid (PCI:0000:00:1e): 64, Dynamic Page Retirement: Page is already pending retirement, reboot to retire page (0x00000000002ce857).
[  376.433349] NVRM: Xid (PCI:0000:00:1e): 64, Dynamic Page Retirement: Page is already pending retirement, reboot to retire page (0x00000000002ce857).
[  376.450507] NVRM: Xid (PCI:0000:00:1e): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000002ce857).

flx42 commented 6 years ago

A DBE is mostly a random event. Were you able to reproduce this issue multiple times?

omerh commented 6 years ago

Yes, once every 10 instance launched on AWS I got this error.

This error occur after the instance is launched and after a python code in cuda container starts.

flx42 commented 6 years ago

I've been told this is a known (but rare) issue with 390.30. You can try 384.111 until it's fixed:

sudo apt-get install --no-install-recommends nvidia-384 libcuda1-384

andrew-hardin commented 6 years ago

@flx42: I've seen the same problem using K80s that reside on in-house servers. You mention that this is a known issue with 390.30 - can you please provide more details or a link to the official bug report?

alanhdu commented 6 years ago

Hi @flx42! If it's possible, I'd also like to see a link to an official bug report somewhere. We're running into similar problems (some small number of p2x.large nodes on AWS fail with CUDA_ERROR_ECC_UNCORRECTABLE when we initialize tensorflow:

2018-03-29 14:40:22.437384: E tensorflow/core/common_runtime/direct_session.cc:167] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_ECC_UNCORRECTABLE

But I'd like to confirm that this is the same underlying root cause with 390.30.

daniel-j-h commented 6 years ago

We are running into the same CUDA_ERROR_ECC_UNCORRECTABLE problem with the NVIDIA 390.12 driver when deploying https://github.com/mapbox/robosat onto AWS p2.xlarge instances using the nvidia container runtime.

I've been told this is a known (but rare) issue with 390.30.

We are hitting this issue multiple times a week; it's not rare at least not for us.

Any updates on this @flx42? Any links to a ticket, bug report, or a timeline for a fix? Downgrading drivers is a workaround at best and comes with its own problems.

flx42 commented 6 years ago

Since the original bug report, new driver versions have been released, you should use 396.37.

crazyleeyang commented 6 years ago

We are also running into the same problem with NVIDIA 390.12 driver 390.46

omerh commented 6 years ago

As a workaround to this issues. Once wrapping the AMI I've changed the following:

sudo nvidia-smi -e 0 ## AWS TESLA K80 Issues
sudo nvidia-smi -i 0 ## AWS TESLA K80 Issues, with 1 single GPU

lovejoy commented 5 years ago

nvidia-smi -r -i 0 (your device id) may fix this issue

NVIDIA / nvidia-container-runtime

When running cuda container with v2 I get ECC error from GPU driver #17