ComputationalRadiationPhysics / jungfrau-photoncounter

Conversion of Jungfrau pixel detector data to photon count rate
GNU General Public License v3.0
2 stars 2 forks source link

Error correction and detection on Nvidia TESLA GPUs #36

Closed lopez-c closed 6 years ago

lopez-c commented 6 years ago

In our application the detection of single bit errors are not easy to detect and may lead to gross errors. AFAIK, Tesla GPUs are able to correct single-bit errors and detect double-bit errors, so understanding how often memory errors occur and how the application should deal with them is of primary importance.

At a first stage, providing the number of errors detected during the run as debug information should be enough to know how often this error occurs. Later on, we should agree on the actions to be taken when such errors are detected.

TheFl0w commented 6 years ago

This is what i found out about ECC on nvidia GPUs: The latest GPU models have ECC RAM similar to RAM on a host system. Older GPU architectures emulate ECC with a software-based solution. In any case errors are not handled by the programmer but either the runtime or hardware itself. The programmer is only notified when an uncorrectable error occurs.