Closed omerh closed 6 years ago
A DBE is mostly a random event. Were you able to reproduce this issue multiple times?
Yes, once every 10 instance launched on AWS I got this error.
This error occur after the instance is launched and after a python code in cuda container starts.
I've been told this is a known (but rare) issue with 390.30. You can try 384.111 until it's fixed:
sudo apt-get install --no-install-recommends nvidia-384 libcuda1-384
@flx42: I've seen the same problem using K80s that reside on in-house servers. You mention that this is a known issue with 390.30 - can you please provide more details or a link to the official bug report?
Hi @flx42! If it's possible, I'd also like to see a link to an official bug report somewhere. We're running into similar problems (some small number of p2x.large
nodes on AWS fail with CUDA_ERROR_ECC_UNCORRECTABLE
when we initialize tensorflow:
2018-03-29 14:40:22.437384: E tensorflow/core/common_runtime/direct_session.cc:167] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_ECC_UNCORRECTABLE
But I'd like to confirm that this is the same underlying root cause with 390.30
.
We are running into the same CUDA_ERROR_ECC_UNCORRECTABLE
problem with the NVIDIA 390.12 driver when deploying https://github.com/mapbox/robosat onto AWS p2.xlarge instances using the nvidia container runtime.
I've been told this is a known (but rare) issue with 390.30.
We are hitting this issue multiple times a week; it's not rare at least not for us.
Any updates on this @flx42? Any links to a ticket, bug report, or a timeline for a fix? Downgrading drivers is a workaround at best and comes with its own problems.
Since the original bug report, new driver versions have been released, you should use 396.37
.
We are also running into the same problem with NVIDIA 390.12 driver 390.46
As a workaround to this issues. Once wrapping the AMI I've changed the following:
sudo nvidia-smi -e 0 ## AWS TESLA K80 Issues
sudo nvidia-smi -i 0 ## AWS TESLA K80 Issues, with 1 single GPU
nvidia-smi -r -i 0 (your device id) may fix this issue
I am using ubuntu 16.04 in AWS instance type is p2.xl
NVIDIA driver:
Error I get after running docker command with nvidia runtime
I get driver error on ECC