ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.08k stars 228 forks source link

[CI] Testing must fail when GPU is required but not found. #1102

Open atamazov opened 3 years ago

atamazov commented 3 years ago

The issue is introduced in #1081, whose intent is to fix https://ontrack-internal.amd.com/browse/SWDEV-297881.


Now we must make it so that our CI does fail when GPU is not found but GPU is required by the Jenkins stage. Why: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1081#issuecomment-902877375

Originally posted by @atamazov in https://github.com/ROCmSoftwarePlatform/MIOpen/issues/1080#issuecomment-902878807

atamazov commented 3 years ago

Note that we often see problems with clinfo (https://github.com/ROCmSoftwarePlatform/frameworks-internal/issues/218). In all such cases tests must fail.

atamazov commented 3 years ago

[Informative] Related review discussion: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1024#discussion_r667070823

atamazov commented 3 years ago

We can modify Jenkinsfile to resolve the problem. @shurale-nkn What do you think?

junliume commented 3 years ago

When GPU is not found but GPU is required by the Jenkins stage:

  1. GPU is there but ROCm fail to detect it. e.g. gfx90a with ROCm4.2. In this case, it should falls into "TESTING IS NOT SUPPORTED FOR THE DETECTED GPU" branch, and fails with "FATAL: GPU DETECTION FAILED DURING CMAKE PHASE, CHECK CMAKE WARNINGS". We have observed it in previous CI fails.

  2. GPU is ... not ... there ? Indeed it will fall in to "ROCk module is NOT loaded, possibly no GPU devices" branch and just continue without fail (which is the problem stated here). But, why would this case happen? Why would Jenkins send job to a GPU-less sever when explicitly targeting labeled node please?

Would there be more complex cases in category 1 above please?

atamazov commented 3 years ago

The current implementation is such that if rocminfo prints "no GPU devices" onto console, then most of the tests will be skipped, and CI stage will not fail, regardless of the reason. I don't see an opportunity to enumerate all the problems that may lead to "no GPU devices" printed by rocminfo.

I know for sure that GPU failures happened many times in the past (clinfo was unable to detect any GPU). Output of clinfo is "Number of devices: 0", and then it returns code 143. Unfortunately I do not remember what rocminfo printed onto console in those cases.

Why would Jenkins send job to a GPU-less sever when explicitly targeting labeled node please?

The simplest theoretical example is GPU malfunction that makes GPU "invisible" to the rocminfo and requires power cycling to restore.

  1. GPU is there but ROCm fail to detect it. e.g. gfx90a with ROCm4.2. In this case, it should falls into "TESTING IS NOT SUPPORTED FOR THE DETECTED GPU" branch, and fails with "FATAL: GPU DETECTION FAILED DURING CMAKE PHASE, CHECK CMAKE WARNINGS". We have observed it in previous CI fails.

[Informative] I don't know what rocminfo prints onto console when it sees some unsupported (too new or too old) type of GPU, like 4.2 with gfx90a. CMake prints "TESTING IS NOT SUPPORTED FOR THE DETECTED GPU" when rocminfo is able to detect the GPU, but the name of GPU printed to the console is not supported by our tests.