ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.08k stars 227 forks source link

[TESTS][Navi21] Unstable behavior with CTEST_PARALLEL_LEVEL=4 #1148

Open junliume opened 3 years ago

junliume commented 3 years ago

[Issue] Navi21 nodes have been unstable for MIOpen CI (ref: #1147). In order to promote efficiency, they are temporarily removed from Full Test stages as per #1135

[Investigation]

The problem seems to be caused by running multiple ctests in parallel on Navi21 nodes:


[Update 09/29] the following list of tasks is outdated.

Tasks with #1135 Workaround In Place:

Even smoke test is not very stable for gfx1030: http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/ddu%2Fbnorm-fwd-checknumerics/6/pipeline


[Update 03/21] enforce test_conv_ck_igemm_fwd_v6r1_dlops_nchw run serial Root cause of these tests which requires serial run needs to be identified.

junliume commented 3 years ago

@okakarpa could you please help with establishing nightly Jenkins check on the following conditions?

atamazov commented 3 years ago

Heads-up: #1153 renames MIOPENNAVI21 to FULL_TESTS_NAVI21_OPTIONAL.

atamazov commented 3 years ago

1168 is a workaround.

atamazov commented 3 years ago

I think that basically, we need to identify a root reason, get rid of it (and remove W/A) and close only after that. Maybe this is a ROCm problem, so let's wait for the next release, update CI and see if we can remove the W/A & close

atamazov commented 2 years ago

test_conv_3d fails when input tensor exceeds 1.34 GiB. This is similar to https://github.com/ROCmSoftwarePlatform/MIOpen/issues/1334#issuecomment-989278146

atamazov commented 2 years ago

Regarding returning to CTEST_PARALLEL_LEVEL=4. With #1336 I do not see any issues on my local Drakkar machine. However the same code fails from time to time when run on CI.

On CI machines, we have

$ uname -r
5.4.0-81-generic
$ cat /sys/module/amdgpu/version
5.11.28
$ ls /opt/ -la
lrwxrwxrwx  1 root root   22 Aug 31 19:46 rocm -> /etc/alternatives/rocm
drwxr-xr-x 17 root root 4096 Aug 31 19:45 rocm-4.5.0-8244

But on my development machine I see

root@Drakkar:/opt/rocm-4.3.1/.info# uname -r
5.11.0-25-generic
root@Drakkar:/dockerx/github/MIOpen# cat /sys/module/amdgpu/version
5.11.14
root@Drakkar:/dockerx/github/MIOpen# ls -l /opt
lrwxrwxrwx  1 root root   22 Sep  2 14:54 rocm -> /etc/alternatives/roc
drwxr-xr-x 20 root root 4096 Sep  2 14:54 rocm-4.3.1

So the development machine has base ROCm/driver versions that matches the ROCm version in the container where I run the tests (4.3.1). But on CI, base ROCm/driver are newer (4.5.0).

I suspect this is the reason of CI failures. Let's try to return to 4 threads when our CI is upgraded to 4.5.x.

Another possible reason is that kernel version is newer on Drakkar, but this is unlikely, I guess.

okakarpa commented 2 years ago

@atamazov The navi21 nodes works better with version rocm-5.0.0-9234.

root@ixt-sjc2-11:~# cat /sys/module/amdgpu/version 5.13.8 root@ixt-sjc2-11:~# ls -l /opt total 12 drwxr-xr-x 4 root root 4096 Dec 15 18:17 amdgpu drwx--x--x 4 root root 4096 Sep 2 20:30 containerd lrwxrwxrwx 1 root root 22 Dec 15 18:17 rocm -> /etc/alternatives/rocm drwxr-xr-x 17 root root 4096 Dec 15 18:17 rocm-5.0.0-9234

root@ixt-sjc2-11:~# apt show rocm-dkms Package: rocm-dkms Version: 5.0.0.50000-crdnnh.9234

The trial branch has passed all the stages on node ixt-sjc2-11

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/wip-rocmtest-trial-navi21/12/pipeline/67

can we Proceed same with all the other nodes across the CI ?

atamazov commented 2 years ago

@okakarpa I recommend updating nodes to the most recent released ROCm version and corresponding kernel driver. 5.0 is not released yet, I would either wait or use 4.5.2.

We must be careful to avoid CI malfunction due to, for example, failures of static checks.

Proposed plan:

/cc @junliume @JehandadKhan @pfultz2

JehandadKhan commented 2 years ago

@jbakhrai for info

atamazov commented 2 years ago

[Off-topic] ixt-sjc2-11 was upgraded to ROCm 5.0 RC and now shows unstable behavior. Let's monitor it and disable it if it spoils the CI pipeline on regular basis.

atamazov commented 2 years ago

[Off-topic] ixt-sjc2-11 disabled. See https://github.com/ROCmSoftwarePlatform/frameworks-internal/issues/470.

ppanchad-amd commented 7 months ago

@junliume Are these issues still reproducible with latest ROCm 6.0.2 (HIP 6.0.32831)? Thanks!

atamazov commented 7 months ago

This issue is still reproducible with 6.0.0 docker and 5.2.3 base driver.