eth-cscs / DLA-Future

DLA-Future
https://eth-cscs.github.io/DLA-Future/master/
BSD 3-Clause "New" or "Revised" License
64 stars 14 forks source link

Some miniapps fail with HIP 5.6 #1049

Open msimberg opened 11 months ago

msimberg commented 11 months ago

While investigating the slowdowns from newer HIP versions I found that the eigensolver miniapp fails immediately on the first iteration (consistently) with:

terminate called after throwing an instance of 'whip::exception'
  what():  invalid argument
srun: error: nid006104: task 0: Segmentation fault
srun: launch/slurm: _step_signal: Terminating StepId=4970716.56
slurmstepd: error: *** STEP 4970716.56 ON nid006104 CANCELLED AT 2023-11-20T22:16:08 ***
srun: error: nid006104: tasks 1-7: Terminated
srun: Force Terminated StepId=4970716.56

I have not investigated this at all. Some miniapps are clearly fine (miniapp_bt_band_to_tridiag) while others fail for some configurations or always (miniapp_bt_reduction_to_band, miniapp_band_to_tridiag). Others were not tested at this point. The core dumps contain nothing useful, so this will need more thorough investigating.