stress:gpu crashes when compiled with CUDA but run without a device

ICLDisco / parsec

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed, GPU accelerated, many-core heterogeneous architectures. PaRSEC assigns computation threads to the cores, GPU accelerators, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.

Other

48 stars 17 forks source link

stress:gpu crashes when compiled with CUDA but run without a device #641

Closed abouteiller closed 4 months ago

abouteiller commented 7 months ago

Describe the bug

The stress:gpu compiled with CUDA support may crash when run on a system without a CUDA GPU. See

https://github.com/ICLDisco/parsec/actions/runs/8190649752/job/22398165838?pr=515

Note that there appears to be some interaction with Intel ZE libraries that is unexpected here.

To Reproduce

See https://github.com/ICLDisco/parsec/actions/runs/8190649752/job/22398165838?pr=515

abouteiller commented 6 months ago

There are 2 different failure modicum:

stress tries to allocate 33GB of memory, which may or may not be possible, especially on low-end cuda devices, or as host memory.
stage tries to run hook = tc->incarnations[chore_id], but disabling the cuda device nullified the hook for the only valid chore_id of 0 (the incarnations array does not have a CPU implementation).

therault commented 6 months ago

I pushed some commit in PR #642 to handle the lack of device more gracefully, both in the test and the runtime system.

However, the test still fails in the PI, since there is no way to fix that at the runtime/test level. The issue is with the CI logic here: serotonin is a machine that has both NVIDIA and ROCM installed at the software stack level, but only ROCM device. The cmake logic defaults to NVIDIA in that case, and we should pass ROCM.

I don't think we should complicate the CMake logic of the tests more here: they fail because we asked them to run on NVIDIA and there is no NVIDIA cards.

Geri is working on preparing a PR on the CI to fix this issue at the CI logic level.

bosilca commented 6 months ago

the CI instances are tagged with their devices. As an example serotonin is tagger with gpu_amd while guyot is tagged with gpu_nvidia. The CI should use these tags to drive the correct set of testing.

abouteiller commented 6 months ago

There is a third problem:

The JDF of the stress tester is not symmetrical

https://github.com/ICLDisco/parsec/blob/1ababbe248064c5f3deaab2f9b04e56b556a3f02/tests/runtime/cuda/stress.jdf#L106 https://github.com/ICLDisco/parsec/blob/1ababbe248064c5f3deaab2f9b04e56b556a3f02/tests/runtime/cuda/stress.jdf#L128

Buggy behavior

This causes occasionally the following behavior

flow GEMM(1,1,0) B <- READ_A(1,0) accesses a data repo entry that has been freed as part of the termination for GEMM(1,0,0) entry 0x1bda310/READ_A(1, 0) of hash table this_task->data._f_B.source_repo has a usage count of 4/4 and is not retained: freeing it at stress.c:1041 @__data_repo_entry_used_once:120 22:26 this happens when the test runs on CPU (devices exist, on b00, but OOM, memory_use=90), not sure why this has anything to do with the changes in the PR (edited)

Potential fix

we believe that the READ B should be from READ_A(m, r) without the (m+r)%mt randomization