Error correction and detection on Nvidia TESLA GPUs

This is what i found out about ECC on nvidia GPUs: The latest GPU models have ECC RAM similar to RAM on a host system. Older GPU architectures emulate ECC with a software-based solution. In any case errors are not handled by the programmer but either the runtime or hardware itself. The programmer is only notified when an uncorrectable error occurs.

GPUs are protected by a Single‐Error Correct Double‐Error Detect (SECDED) ECC code
Pascal architecture (e.g. P100):
- HBM2 DRAM offers native support for ECC in hardware
- does not impact performance
Kepler architecture (e.g. K80):
- can enable software ECC in driver
- reduces memory by ~6.25%
- effective bandwidth reduced by up to 20%
ECC can be turned on/off using the nvidia driver (nvidia-smi)
single errors are corrected automatically by hardware / runtime
uncorrectable errors result in cuda runtime error and error code is returned by cuda functions
gpu can be re-initialized after uncorrectable error, it will lose all memory allocations etc.

ComputationalRadiationPhysics / jungfrau-photoncounter

Error correction and detection on Nvidia TESLA GPUs #36