Closed lopez-c closed 6 years ago
This is what i found out about ECC on nvidia GPUs: The latest GPU models have ECC RAM similar to RAM on a host system. Older GPU architectures emulate ECC with a software-based solution. In any case errors are not handled by the programmer but either the runtime or hardware itself. The programmer is only notified when an uncorrectable error occurs.
GPUs are protected by a Single‐Error Correct Double‐Error Detect (SECDED) ECC code
Pascal architecture (e.g. P100):
Kepler architecture (e.g. K80):
ECC can be turned on/off using the nvidia driver (nvidia-smi)
single errors are corrected automatically by hardware / runtime
uncorrectable errors result in cuda runtime error and error code is returned by cuda functions
gpu can be re-initialized after uncorrectable error, it will lose all memory allocations etc.
In our application the detection of single bit errors are not easy to detect and may lead to gross errors. AFAIK, Tesla GPUs are able to correct single-bit errors and detect double-bit errors, so understanding how often memory errors occur and how the application should deal with them is of primary importance.
At a first stage, providing the number of errors detected during the run as debug information should be enough to know how often this error occurs. Later on, we should agree on the actions to be taken when such errors are detected.