Closed luraess closed 4 months ago
AMDGPU tests pass now having updated the Pipeline to @vchuravy suggestion in #840 and upon using latest OpenMPI and UCX:
OPENMPI_VER: "5.0"
OPENMPI_VER_FULL: "5.0.3"
UCX_VER: "1.17.0"
The only failing tests are test_allreduce.jl,test_reduce.jl,test_scan.jl
which I for now excluded using the ENV var mechanism.
CUDA tests still fail though.
@vchuravy all CUDA tests are failing
Also, CUDA Buildkite workflows (builds and compilation during test) are running close to an order of magnitude slower compared to ROCm ones
They are passing on main https://buildkite.com/julialang/mpi-dot-jl/builds/1451
They slow-down is very weird and it looks like things slow down when things are running parallel on the same machine.
With concurrency set to 1 test should run serially?
Why test pass on master and not here as you merged master in this branch?
Why test pass on master and not here as you merged master in this branch?
So, using latest OpenMPI and UCX as for ROCm in CUDA Buildkite CI segfaults. Rolling back to versions as on master
fixes it.
Tests now pass (with exception of test_allreduce.jl,test_reduce.jl,test_scan.jl
for ROCm), and codecov seems complaining about changes
and project
.
Address #841 Superseed #839 wrt AMDGPU compat
Getting ROCm CI back on track