Open junliume opened 3 years ago
@okakarpa could you please help with establishing nightly Jenkins check on the following conditions?
MIOPENNAVI21
~ FULL_TESTS_NAVI21_OPTIONAL
Heads-up: #1153 renames MIOPENNAVI21 to FULL_TESTS_NAVI21_OPTIONAL.
I think that basically, we need to identify a root reason, get rid of it (and remove W/A) and close only after that. Maybe this is a ROCm problem, so let's wait for the next release, update CI and see if we can remove the W/A & close
test_conv_3d
fails when input tensor exceeds 1.34 GiB. This is similar to https://github.com/ROCmSoftwarePlatform/MIOpen/issues/1334#issuecomment-989278146
Regarding returning to CTEST_PARALLEL_LEVEL=4
. With #1336 I do not see any issues on my local Drakkar machine. However the same code fails from time to time when run on CI.
On CI machines, we have
$ uname -r
5.4.0-81-generic
$ cat /sys/module/amdgpu/version
5.11.28
$ ls /opt/ -la
lrwxrwxrwx 1 root root 22 Aug 31 19:46 rocm -> /etc/alternatives/rocm
drwxr-xr-x 17 root root 4096 Aug 31 19:45 rocm-4.5.0-8244
But on my development machine I see
root@Drakkar:/opt/rocm-4.3.1/.info# uname -r
5.11.0-25-generic
root@Drakkar:/dockerx/github/MIOpen# cat /sys/module/amdgpu/version
5.11.14
root@Drakkar:/dockerx/github/MIOpen# ls -l /opt
lrwxrwxrwx 1 root root 22 Sep 2 14:54 rocm -> /etc/alternatives/roc
drwxr-xr-x 20 root root 4096 Sep 2 14:54 rocm-4.3.1
So the development machine has base ROCm/driver versions that matches the ROCm version in the container where I run the tests (4.3.1). But on CI, base ROCm/driver are newer (4.5.0).
I suspect this is the reason of CI failures. Let's try to return to 4 threads when our CI is upgraded to 4.5.x.
Another possible reason is that kernel version is newer on Drakkar, but this is unlikely, I guess.
@atamazov The navi21 nodes works better with version rocm-5.0.0-9234.
root@ixt-sjc2-11:~# cat /sys/module/amdgpu/version 5.13.8 root@ixt-sjc2-11:~# ls -l /opt total 12 drwxr-xr-x 4 root root 4096 Dec 15 18:17 amdgpu drwx--x--x 4 root root 4096 Sep 2 20:30 containerd lrwxrwxrwx 1 root root 22 Dec 15 18:17 rocm -> /etc/alternatives/rocm drwxr-xr-x 17 root root 4096 Dec 15 18:17 rocm-5.0.0-9234
root@ixt-sjc2-11:~# apt show rocm-dkms Package: rocm-dkms Version: 5.0.0.50000-crdnnh.9234
The trial branch has passed all the stages on node ixt-sjc2-11
can we Proceed same with all the other nodes across the CI ?
@okakarpa I recommend updating nodes to the most recent released ROCm version and corresponding kernel driver. 5.0 is not released yet, I would either wait or use 4.5.2.
We must be careful to avoid CI malfunction due to, for example, failures of static checks.
rocmtest-5.0
, for example)wip-rocmtest-5.0-upgrade
; I can do that)wip-rocmtest-5.0-upgrade
and make sure that all tests pass.
wip-rocmtest-5.0-upgrade
.rocmtest-5.0
).wip-rocmtest-5.0-upgrade
into develop
and make sure it passes CI.develop
into all trial branches and test all nodes.develop
into their development branches./cc @junliume @JehandadKhan @pfultz2
@jbakhrai for info
[Off-topic] ixt-sjc2-11
was upgraded to ROCm 5.0 RC and now shows unstable behavior. Let's monitor it and disable it if it spoils the CI pipeline on regular basis.
[Off-topic] ixt-sjc2-11 disabled. See https://github.com/ROCmSoftwarePlatform/frameworks-internal/issues/470.
@junliume Are these issues still reproducible with latest ROCm 6.0.2 (HIP 6.0.32831)? Thanks!
This issue is still reproducible with 6.0.0 docker and 5.2.3 base driver.
[Issue] Navi21 nodes have been unstable for MIOpen CI (ref: #1147). In order to promote efficiency, they are temporarily removed from Full Test stages as per #1135
[Investigation]
The problem seems to be caused by running multiple ctests in parallel on Navi21 nodes:
1168 has done a few tests and here are the findings:
CTEST_PARALLEL_LEVEL=2
is a balance between stability and stage walltime limitationtest_conv_3d, test_conv_group, test_conv_extra, test_conv_for_implicit_gemm, test_soft_max
[Update 09/29] the following list of tasks is outdated.
Tasks with #1135 Workaround In Place:
Even smoke test is not very stable for gfx1030: http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/ddu%2Fbnorm-fwd-checknumerics/6/pipeline
[Update 03/21] enforce test_conv_ck_igemm_fwd_v6r1_dlops_nchw run serial Root cause of these tests which requires serial run needs to be identified.