Closed abouteiller closed 4 months ago
There are 2 different failure modicum:
I pushed some commit in PR #642 to handle the lack of device more gracefully, both in the test and the runtime system.
However, the test still fails in the PI, since there is no way to fix that at the runtime/test level. The issue is with the CI logic here: serotonin is a machine that has both NVIDIA and ROCM installed at the software stack level, but only ROCM device. The cmake logic defaults to NVIDIA in that case, and we should pass ROCM.
I don't think we should complicate the CMake logic of the tests more here: they fail because we asked them to run on NVIDIA and there is no NVIDIA cards.
Geri is working on preparing a PR on the CI to fix this issue at the CI logic level.
the CI instances are tagged with their devices. As an example serotonin is tagger with gpu_amd
while guyot is tagged with gpu_nvidia
. The CI should use these tags to drive the correct set of testing.
There is a third problem:
The JDF of the stress tester is not symmetrical
https://github.com/ICLDisco/parsec/blob/1ababbe248064c5f3deaab2f9b04e56b556a3f02/tests/runtime/cuda/stress.jdf#L106 https://github.com/ICLDisco/parsec/blob/1ababbe248064c5f3deaab2f9b04e56b556a3f02/tests/runtime/cuda/stress.jdf#L128
This causes occasionally the following behavior
flow GEMM(1,1,0) B <- READ_A(1,0) accesses a data repo entry that has been freed as part of the termination for GEMM(1,0,0) entry 0x1bda310/READ_A(1, 0) of hash table this_task->data._f_B.source_repo has a usage count of 4/4 and is not retained: freeing it at stress.c:1041 @__data_repo_entry_used_once:120 22:26 this happens when the test runs on CPU (devices exist, on b00, but OOM, memory_use=90), not sure why this has anything to do with the changes in the PR (edited)
we believe that the READ B should be from READ_A(m, r) without the (m+r)%mt randomization
Describe the bug
The stress:gpu compiled with CUDA support may crash when run on a system without a CUDA GPU. See
https://github.com/ICLDisco/parsec/actions/runs/8190649752/job/22398165838?pr=515
Note that there appears to be some interaction with Intel ZE libraries that is unexpected here.
To Reproduce
See https://github.com/ICLDisco/parsec/actions/runs/8190649752/job/22398165838?pr=515