JuliaParallel / MPI.jl

MPI wrappers for Julia
https://juliaparallel.org/MPI.jl/
The Unlicense
381 stars 122 forks source link

Fix ROCm CI #844

Closed luraess closed 4 months ago

luraess commented 5 months ago

Address #841 Superseed #839 wrt AMDGPU compat

Getting ROCm CI back on track

luraess commented 5 months ago

AMDGPU tests pass now having updated the Pipeline to @vchuravy suggestion in #840 and upon using latest OpenMPI and UCX:

OPENMPI_VER: "5.0"
OPENMPI_VER_FULL: "5.0.3"
UCX_VER: "1.17.0"

The only failing tests are test_allreduce.jl,test_reduce.jl,test_scan.jl which I for now excluded using the ENV var mechanism.

CUDA tests still fail though.

luraess commented 5 months ago

@vchuravy all CUDA tests are failing

luraess commented 5 months ago

Also, CUDA Buildkite workflows (builds and compilation during test) are running close to an order of magnitude slower compared to ROCm ones

vchuravy commented 4 months ago

They are passing on main https://buildkite.com/julialang/mpi-dot-jl/builds/1451

They slow-down is very weird and it looks like things slow down when things are running parallel on the same machine.

luraess commented 4 months ago

With concurrency set to 1 test should run serially?

Why test pass on master and not here as you merged master in this branch?

luraess commented 4 months ago

Why test pass on master and not here as you merged master in this branch?

So, using latest OpenMPI and UCX as for ROCm in CUDA Buildkite CI segfaults. Rolling back to versions as on master fixes it.

Tests now pass (with exception of test_allreduce.jl,test_reduce.jl,test_scan.jl for ROCm), and codecov seems complaining about changes and project.