Open dh-ilight opened 2 years ago
Hi @dhiepler -- I also saw your trouble ticket at NERSC, let me respond here so that the community can weigh in.
without having tried it myself (I am currently traveling, and have bad internet -- and alas my burner laptop doesn't have Julia), here are some ideas:
a
and b
ordered in the same way (normally it's a largest-to-smallest order but this should be checked)Matrix
the right object to use? Or do we want DistMatrix
?a
and b
are actually singular values (i.e. check if they are the absolute value of eigenvalues)?I will check these myself soon, but maybe @andreasnoack has some ideas in the meantime.
Cheers, Johannes
I have more information thanks also to @dhiepler, here is an expanded test program:
using Distributed
using Elemental
using DistributedArrays
using LinearAlgebra
@everywhere using Random
@everywhere Random.seed!(123)
A = drandn(10,10)
Al = Matrix(A)
println("sum of diffs= ",sum(A-Al))
println("Determinant of singular matrix is 0")
#println("det(A)= ", det(A)) # not implemented for DArray
println("det(Al)= ", det(Al))
a = svdvals(Al)
b = Elemental.svdvals(A)
println("a= ", a)
println("b= ", b)
println("a-b= ", a-b)
println("a./b= ", a./b)
#println("det(a) = ", det(a))
println("sum of ratios after svd = ",sum(a./b))
println("sum of diffs after svd = ",sum(a-b))
println("approx= ", a ≈ b)
What I found was rather startling: it works fine on my laptop with more than one MPI rank. When I run it using MPI on Cori, then I get a constant factor difference in each element:
blaschke@nid12849:~/tmp> srun -n 1 julia test_sdvals_2.jl
sum of diffs= 0.0
Determinant of singular matrix is 0
det(Al)= -1191.023311448886
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
b= [6.351702336978771; 5.880409677742518; 4.799195878542354; 3.3478338822775187; 2.980624165171743; 2.0566695219525486; 1.4953776892773092; 1.1050612757140172; 0.6085328305411005; 0.3219564747545404]
a-b= [4.440892098500626e-15; -2.6645352591003757e-15; 1.7763568394002505e-15; 1.3322676295501878e-15; -1.3322676295501878e-15; 0.0; 2.220446049250313e-16; -8.881784197001252e-16; -3.3306690738754696e-16; 2.220446049250313e-16]
a./b= [1.0000000000000007; 0.9999999999999996; 1.0000000000000004; 1.0000000000000004; 0.9999999999999996; 1.0; 1.0000000000000002; 0.9999999999999992; 0.9999999999999994; 1.0000000000000007]
sum of ratios after svd = 9.999999999999998
sum of diffs after svd = 2.7755575615628914e-15
approx= true
blaschke@nid12849:~/tmp> srun -n 2 julia test_sdvals_2.jl
sum of diffs= 0.0
Determinant of singular matrix is 0
sum of diffs= 0.0
Determinant of singular matrix is 0
det(Al)= -1191.023311448886
det(Al)= -1191.023311448886
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
b= [12.703404673957543; 11.760819355485031; 9.598391757084704; 6.695667764555037; 5.961248330343489; 4.113339043905101; 2.990755378554618; 2.210122551428033; 1.2170656610821995; 0.6439129495090811]
b= [12.703404673957543; 11.760819355485031; 9.598391757084704; 6.695667764555037; 5.961248330343489; 4.113339043905101; 2.990755378554618; 2.210122551428033; 1.2170656610821995; 0.6439129495090811]
a-b= [-6.351702336978768; -5.880409677742516; -4.799195878542348; -3.3478338822775173; -2.980624165171747; -2.056669521952552; -1.4953776892773085; -1.1050612757140168; -0.6085328305410993; -0.3219564747545404]
a-b= [-6.351702336978768; -5.880409677742516; -4.799195878542348; -3.3478338822775173; -2.980624165171747; -2.056669521952552; -1.4953776892773085; -1.1050612757140168; -0.6085328305410993; -0.3219564747545404]
a./b= [0.5000000000000003; 0.5; 0.5000000000000003; 0.5000000000000002; 0.49999999999999956; 0.49999999999999956; 0.5000000000000001; 0.4999999999999999; 0.5000000000000003; 0.5000000000000002]
sum of ratios after svd = 5.0
sum of diffs after svd = -28.947363732952414
a./b= [0.5000000000000003; 0.5; 0.5000000000000003; 0.5000000000000002; 0.49999999999999956; 0.49999999999999956; 0.5000000000000001; 0.4999999999999999; 0.5000000000000003; 0.5000000000000002]
sum of ratios after svd = 5.0
sum of diffs after svd = -28.947363732952414
approx= false
approx= false
blaschke@nid12849:~/tmp> srun -n 4 julia test_sdvals_2.jl
sum of diffs= 0.0
Determinant of singular matrix is 0
sum of diffs= 0.0
sum of diffs= 0.0
Determinant of singular matrix is 0
Determinant of singular matrix is 0
sum of diffs= 0.0
Determinant of singular matrix is 0
det(Al)= -1191.023311448886
det(Al)= -1191.023311448886
det(Al)= -1191.023311448886
det(Al)= -1191.023311448886
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
b= [25.406809347915086; 23.521638710970063; 19.196783514169415; 13.391335529110073; 11.92249666068697; 8.2266780878102; 5.981510757109234; 4.420245102856065; 2.4341313221643994; 1.2878258990181617]
b= [25.406809347915086; 23.521638710970063; 19.196783514169415; 13.391335529110073; 11.92249666068697; 8.2266780878102; 5.981510757109234; 4.420245102856065; 2.4341313221643994; 1.2878258990181617]
b= [25.406809347915086; 23.521638710970063; 19.196783514169415; 13.391335529110073; 11.92249666068697; 8.2266780878102; 5.981510757109234; 4.420245102856065; 2.4341313221643994; 1.2878258990181617]
b= [25.406809347915086; 23.521638710970063; 19.196783514169415; 13.391335529110073; 11.92249666068697; 8.2266780878102; 5.981510757109234; 4.420245102856065; 2.4341313221643994; 1.2878258990181617]
a-b= [-19.05510701093631; -17.641229033227546; -14.397587635627058; -10.043501646832553; -8.941872495515229; -6.170008565857652; -4.486133067831925; -3.3151838271420493; -1.8255984916232992; -0.965869424263621]
a-b= [-19.05510701093631; -17.641229033227546; -14.397587635627058; -10.043501646832553; -8.941872495515229; -6.170008565857652; -4.486133067831925; -3.3151838271420493; -1.8255984916232992; -0.965869424263621]
a-b= [-19.05510701093631; -17.641229033227546; -14.397587635627058; -10.043501646832553; -8.941872495515229; -6.170008565857652; -4.486133067831925; -3.3151838271420493; -1.8255984916232992; -0.965869424263621]
a-b= [-19.05510701093631; -17.641229033227546; -14.397587635627058; -10.043501646832553; -8.941872495515229; -6.170008565857652; -4.486133067831925; -3.3151838271420493; -1.8255984916232992; -0.965869424263621]
a./b= [0.25000000000000017; 0.25; 0.2500000000000001; 0.2500000000000001; 0.24999999999999992; 0.24999999999999983; 0.25000000000000017; 0.25; 0.2500000000000001; 0.25000000000000017]
sum of ratios after svd = 2.5000000000000004
sum of diffs after svd = -86.84209119885725
a./b= [0.25000000000000017; 0.25; 0.2500000000000001; 0.2500000000000001; 0.24999999999999992; 0.24999999999999983; 0.25000000000000017; 0.25; 0.2500000000000001; 0.25000000000000017]
sum of ratios after svd = 2.5000000000000004
sum of diffs after svd = -86.84209119885725
a./b= [0.25000000000000017; 0.25; 0.2500000000000001; 0.2500000000000001; 0.24999999999999992; 0.24999999999999983; 0.25000000000000017; 0.25; 0.2500000000000001; 0.25000000000000017]
sum of ratios after svd = 2.5000000000000004
sum of diffs after svd = -86.84209119885725
a./b= [0.25000000000000017; 0.25; 0.2500000000000001; 0.2500000000000001; 0.24999999999999992; 0.24999999999999983; 0.25000000000000017; 0.25; 0.2500000000000001; 0.25000000000000017]
sum of ratios after svd = 2.5000000000000004
sum of diffs after svd = -86.84209119885725
approx= false
approx= false
approx= false
approx= false
My crude guess would be that there is a factor difference of 1/n
(n=number of ranks). I built the version on Cori against Cray-MPICH. Any ideas @andreasnoack
I can reproduce the bug on my laptop. My guess is that this is an issue with the conversion between Elemental arrays and DistributedArrays.DArray. If I just use the Elemental's DistMatrix
then I don't get wrong results, i.e.
using Elemental
using LinearAlgebra
A = Elemental.DistMatrix(Float64)
Elemental.gaussian!(A, 10, 10)
Al = Array(A)
println("sum of diffs= ", sum(A - Al))
a = svdvals(Al)
b = svdvals(A)
println("a = ", a)
println("b = ", b)
println("a - b= ", a - b)
println("a ./ b= ", a ./ b)
@info "DONE!"
Thanks @andreasnoack -- I just verified that this also works on Cori. This leaves an open question given what @dhiepler is trying to do with the original code: How to create an Elemental array from an existing array (if not from a DArray) ?
using MPIClusterManagers, Distributed
man = MPIManager(np = 4)
addprocs(man) # backend communicates with MPI
@everywhere begin
using Elemental
using LinearAlgebra
end
using DistributedArrays
A = drandn(50,50)
Al = Matrix(A)
a = svdvals(Al)
b = Array(Elemental.svdvals(A))
println("sum of diffs= ",sum(a-b))
@mpi_do man begin
grid = Elemental.DefaultGrid()
@show (Elemental.row(grid), Elemental.column(grid))
A = Elemental.DistMatrix(Float64)
Elemental.zeros!(A, 50,50)
@show Elemental.localHeight(A), Elemental.height(A)
@show Elemental.localWidth(A), Elemental.width(A)
end
On https://github.com/JuliaParallel/Elemental.jl/pull/77 resulted for me in:
From worker 3: (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
From worker 2: (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
From worker 4: (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
From worker 5: (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
From worker 3: (Elemental.localHeight(A), Elemental.height(A)) = (50, 50)
From worker 2: (Elemental.localHeight(A), Elemental.height(A)) = (50, 50)
From worker 4: (Elemental.localHeight(A), Elemental.height(A)) = (50, 50)
From worker 3: (Elemental.localWidth(A), Elemental.width(A)) = (50, 50)
From worker 2: (Elemental.localWidth(A), Elemental.width(A)) = (50, 50)
From worker 5: (Elemental.localHeight(A), Elemental.height(A)) = (50, 50)
From worker 4: (Elemental.localWidth(A), Elemental.width(A)) = (50, 50)
From worker 5: (Elemental.localWidth(A), Elemental.width(A)) = (50, 50)
Meaning that the Grid
configuration did not work. I double checked my MPI.jl and I was using OpenBLAS_jll
as my MPI provider (I would believe that using the system MPI would yield the same effect.). The default Elemental_jll
we are shipping uses MPICH_jll
You can check yourself with:
julia> MPI.identify_implementation()
(MPI.MPICH, v"3.4.2")
After switching to MPICH_jll:
JULIA_MPI_BINARY="MPICH_jll" julia --project=. -e 'import Pkg; Pkg.build("MPI")'
I got:
vchuravy@odin ~/s/s/a/svd2> JULIA_PROJECT=(pwd) julia --project=. test.jl
From worker 3: (Elemental.row(grid), Elemental.column(grid)) = (1, 0)
From worker 2: (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
From worker 4: (Elemental.row(grid), Elemental.column(grid)) = (0, 1)
From worker 5: (Elemental.row(grid), Elemental.column(grid)) = (1, 1)
From worker 3: (Elemental.localHeight(A), Elemental.height(A)) = (25, 50)
From worker 2: (Elemental.localHeight(A), Elemental.height(A)) = (25, 50)
From worker 4: (Elemental.localHeight(A), Elemental.height(A)) = (25, 50)
From worker 3: (Elemental.localWidth(A), Elemental.width(A)) = (25, 50)
From worker 2: (Elemental.localWidth(A), Elemental.width(A)) = (25, 50)
From worker 5: (Elemental.localHeight(A), Elemental.height(A)) = (25, 50)
From worker 4: (Elemental.localWidth(A), Elemental.width(A)) = (25, 50)
From worker 5: (Elemental.localWidth(A), Elemental.width(A)) = (25, 50)
When https://github.com/JuliaParallel/MPI.jl/pull/513 lands we will be able to fix issues like this.
Now:
using MPIClusterManagers, Distributed
man = MPIManager(np = 4)
addprocs(man) # backend communicates with MPI
using Elemental
using LinearAlgebra
using DistributedArrays
A = drandn(50,50)
Al = Matrix(A)
a = svdvals(Al)
b = Array(Elemental.svdvals(A))
println("sum of diffs= ",sum(a-b))
yields sum of diffs= 1.2073675392798577e-14
Thanks @vchuravy for looking into this also. Unfortunately it doesn't work on Cori. Is the MPIClusterManagers
part necessary? (I find that it seems to hang/time out on Cori -- maybe we can look into why in a separate issue/part of the Julia for HPC group).
Taking out the MPIClusterManagers
part, I get:
blaschke@nid00805:~/tmp> JULIA_MPI_BINARY="MPICH_jll" srun -n 4 julia test_sdvals_4.jl
sum of diffs= -297.4855631120604
sum of diffs= -296.6150430552368
sum of diffs= -303.4465072750694
sum of diffs= -293.6032813641691
using the system MPICH (i.e. cray-mpich).
Yeah the MPIClusterManager is necessary for the variant of the code that uses DistributedArrays otherwise the ranks are not wired-up correctly.
Oh Feck! So now I have to fix MPIClusterManager
@vchuravy should I open an issue on https://github.com/JuliaParallel/MPIClusterManagers.jl?
should I open an issue
Please do :)
Done!
@vchuravy Based on our discussion in JuliaParallel/MPIClusterManagers.jl#26 your solution was to run the addprocs
version without srun
. That won't work here though, because without srun
running:
using Elemental
throws
Wed Dec 8 12:00:48 2021: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
Wed Dec 8 12:00:48 2021: [unset]:_pmi_init:_pmi_alps_init returned -1
[Wed Dec 8 12:00:59 2021] [c3-0c2s12n1] Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(647).......: PMI2 init failed: 1
when trying to initialize Elemental.jl ... we appear to have a catch 22 :(
And if I avoid the addprocs
and run the following within srun
, I get:
using MPIClusterManagers, Distributed
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)
using Elemental
using LinearAlgebra
using DistributedArrays
A = drandn(50,50)
Al = Matrix(A)
a = svdvals(Al)
b = Array(Elemental.svdvals(A))
println("sum of diffs= ",sum(a-b))
MPIClusterManagers.stop_main_loop(manager)
I get a deadlock.
Without srun, but with salloc?
Elemental needs srun
(without srun
, Elemental won't be able to initialize Cray MPICH)
I am confused. MPIClusterManager should use srun
/mpiexec
to connect the "worker" processes which allows them to use Elemental and to communicate among themselves using MPI. The front-end should not need to run under mpiexec.
I don't know all the details at this point, but if I run Elemental (remember, we don't use the one provided by BinaryBuilder -- that one doesn't use Cray MPICH, and therefore doesn't work on Cori -- but instead we build our own version using the Cray compiler wrappers) without srun
, then I get:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(647).......: PMI2 init failed: 1
This is a common error when a program runs MPI_Init()
without being launched in srun
. I haven't had a chance to dig around the guts of MPIClusterManagers
(I plan to). I suspect that MPIClusterManagers are launching the workers using srun
, thereby avoiding the srun
-less MPI_Init
.
Here is a theory that I will try next: would wrapping using Elemental
with a @mpi_do
fix this? (I.e. we want to avoid the situation where anything except the workers will try to initialize MPI).
Oh yeah that makes sense. That is rather unfriendly behavior on the Cray side... We still load the Elemental library on the front-end process.
Maybe we can use the PMI cluster manager I wrote https://github.com/JuliaParallel/PMI.jl/blob/main/examples/distributed.jl
Alas @vchuravy Cray MPICH wants PMI2 :(
cori01:~> export LD_LIBRARY_PATH=/usr/lib64/slurmpmi:$LD_LIBRARY_PATH
cori01:~> srun -n 2 -C haswell julia pmi_simp.jl
srun: job 51995644 queued and waiting for resources
srun: job 51995644 has been allocated resources
/global/common/cori_cle7/software/julia/julia/1.6.0/bin/julia: symbol lookup error: /opt/cray/pe/mpt/7.7.14/gni/mpich-intel/16.0/lib/libmpich.so: undefined symbol: PMI2_Init
srun: error: nid00915: task 0: Exited with exit code 127
srun: launch/slurm: _step_signal: Terminating StepId=51995644.0
srun: error: nid00915: task 1: Exited with exit code 1
My goal is to add svd(A::DArray) to the julia interface to elemental. I have been looking at svdvals for understanding how to do it. But I do not know how to use svdvals in parallel. The following program has a sum of diffs of close to 0 when using a single process. When I run it with julia -p 2 or mpiexecjl -n 2 the sum of diffs is large. How do I convert the program to run in parallel ?