JuliaParallel / Elemental.jl

Julia interface to the Elemental linear algebra library.
Other
78 stars 15 forks source link

How do I use svdvals in parallel ? #75

Open dh-ilight opened 2 years ago

dh-ilight commented 2 years ago

My goal is to add svd(A::DArray) to the julia interface to elemental. I have been looking at svdvals for understanding how to do it. But I do not know how to use svdvals in parallel. The following program has a sum of diffs of close to 0 when using a single process. When I run it with julia -p 2 or mpiexecjl -n 2 the sum of diffs is large. How do I convert the program to run in parallel ?

using Elemental
using DistributedArrays
using LinearAlgebra

A  = drandn(50,50)
Al = Matrix(A)

a = svdvals(Al)
b = Elemental.svdvals(A)
println("sum of diffs= ",sum(a-b))
JBlaschke commented 2 years ago

Hi @dhiepler -- I also saw your trouble ticket at NERSC, let me respond here so that the community can weigh in.

without having tried it myself (I am currently traveling, and have bad internet -- and alas my burner laptop doesn't have Julia), here are some ideas:

  1. Are a and b ordered in the same way (normally it's a largest-to-smallest order but this should be checked)
  2. Is Matrix the right object to use? Or do we want DistMatrix?
  3. Can you check if a and b are actually singular values (i.e. check if they are the absolute value of eigenvalues)?

I will check these myself soon, but maybe @andreasnoack has some ideas in the meantime.

Cheers, Johannes

JBlaschke commented 2 years ago

I have more information thanks also to @dhiepler, here is an expanded test program:

using Distributed
using Elemental
using DistributedArrays
using LinearAlgebra
@everywhere using Random
@everywhere Random.seed!(123)

A = drandn(10,10)
Al = Matrix(A)
println("sum of diffs= ",sum(A-Al))

println("Determinant of singular matrix is 0")
#println("det(A)= ", det(A)) # not implemented for DArray
println("det(Al)= ", det(Al))

a = svdvals(Al)
b = Elemental.svdvals(A)
println("a= ", a)
println("b= ", b)
println("a-b= ", a-b)
println("a./b= ", a./b)

#println("det(a) = ", det(a))
println("sum of ratios after svd = ",sum(a./b))
println("sum of diffs after svd = ",sum(a-b))
println("approx= ", a ≈ b) 

What I found was rather startling: it works fine on my laptop with more than one MPI rank. When I run it using MPI on Cori, then I get a constant factor difference in each element:

One rank:

blaschke@nid12849:~/tmp> srun -n 1 julia test_sdvals_2.jl 
sum of diffs= 0.0
Determinant of singular matrix is 0
det(Al)= -1191.023311448886
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
b= [6.351702336978771; 5.880409677742518; 4.799195878542354; 3.3478338822775187; 2.980624165171743; 2.0566695219525486; 1.4953776892773092; 1.1050612757140172; 0.6085328305411005; 0.3219564747545404]
a-b= [4.440892098500626e-15; -2.6645352591003757e-15; 1.7763568394002505e-15; 1.3322676295501878e-15; -1.3322676295501878e-15; 0.0; 2.220446049250313e-16; -8.881784197001252e-16; -3.3306690738754696e-16; 2.220446049250313e-16]
a./b= [1.0000000000000007; 0.9999999999999996; 1.0000000000000004; 1.0000000000000004; 0.9999999999999996; 1.0; 1.0000000000000002; 0.9999999999999992; 0.9999999999999994; 1.0000000000000007]
sum of ratios after svd = 9.999999999999998
sum of diffs after svd = 2.7755575615628914e-15
approx= true

Two ranks

blaschke@nid12849:~/tmp> srun -n 2 julia test_sdvals_2.jl 
sum of diffs= 0.0
Determinant of singular matrix is 0
sum of diffs= 0.0
Determinant of singular matrix is 0
det(Al)= -1191.023311448886
det(Al)= -1191.023311448886
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
b= [12.703404673957543; 11.760819355485031; 9.598391757084704; 6.695667764555037; 5.961248330343489; 4.113339043905101; 2.990755378554618; 2.210122551428033; 1.2170656610821995; 0.6439129495090811]
b= [12.703404673957543; 11.760819355485031; 9.598391757084704; 6.695667764555037; 5.961248330343489; 4.113339043905101; 2.990755378554618; 2.210122551428033; 1.2170656610821995; 0.6439129495090811]
a-b= [-6.351702336978768; -5.880409677742516; -4.799195878542348; -3.3478338822775173; -2.980624165171747; -2.056669521952552; -1.4953776892773085; -1.1050612757140168; -0.6085328305410993; -0.3219564747545404]
a-b= [-6.351702336978768; -5.880409677742516; -4.799195878542348; -3.3478338822775173; -2.980624165171747; -2.056669521952552; -1.4953776892773085; -1.1050612757140168; -0.6085328305410993; -0.3219564747545404]
a./b= [0.5000000000000003; 0.5; 0.5000000000000003; 0.5000000000000002; 0.49999999999999956; 0.49999999999999956; 0.5000000000000001; 0.4999999999999999; 0.5000000000000003; 0.5000000000000002]
sum of ratios after svd = 5.0
sum of diffs after svd = -28.947363732952414
a./b= [0.5000000000000003; 0.5; 0.5000000000000003; 0.5000000000000002; 0.49999999999999956; 0.49999999999999956; 0.5000000000000001; 0.4999999999999999; 0.5000000000000003; 0.5000000000000002]
sum of ratios after svd = 5.0
sum of diffs after svd = -28.947363732952414
approx= false
approx= false

Four ranks

blaschke@nid12849:~/tmp> srun -n 4 julia test_sdvals_2.jl 
sum of diffs= 0.0
Determinant of singular matrix is 0
sum of diffs= 0.0
sum of diffs= 0.0
Determinant of singular matrix is 0
Determinant of singular matrix is 0
sum of diffs= 0.0
Determinant of singular matrix is 0
det(Al)= -1191.023311448886
det(Al)= -1191.023311448886
det(Al)= -1191.023311448886
det(Al)= -1191.023311448886
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
b= [25.406809347915086; 23.521638710970063; 19.196783514169415; 13.391335529110073; 11.92249666068697; 8.2266780878102; 5.981510757109234; 4.420245102856065; 2.4341313221643994; 1.2878258990181617]
b= [25.406809347915086; 23.521638710970063; 19.196783514169415; 13.391335529110073; 11.92249666068697; 8.2266780878102; 5.981510757109234; 4.420245102856065; 2.4341313221643994; 1.2878258990181617]
b= [25.406809347915086; 23.521638710970063; 19.196783514169415; 13.391335529110073; 11.92249666068697; 8.2266780878102; 5.981510757109234; 4.420245102856065; 2.4341313221643994; 1.2878258990181617]
b= [25.406809347915086; 23.521638710970063; 19.196783514169415; 13.391335529110073; 11.92249666068697; 8.2266780878102; 5.981510757109234; 4.420245102856065; 2.4341313221643994; 1.2878258990181617]
a-b= [-19.05510701093631; -17.641229033227546; -14.397587635627058; -10.043501646832553; -8.941872495515229; -6.170008565857652; -4.486133067831925; -3.3151838271420493; -1.8255984916232992; -0.965869424263621]
a-b= [-19.05510701093631; -17.641229033227546; -14.397587635627058; -10.043501646832553; -8.941872495515229; -6.170008565857652; -4.486133067831925; -3.3151838271420493; -1.8255984916232992; -0.965869424263621]
a-b= [-19.05510701093631; -17.641229033227546; -14.397587635627058; -10.043501646832553; -8.941872495515229; -6.170008565857652; -4.486133067831925; -3.3151838271420493; -1.8255984916232992; -0.965869424263621]
a-b= [-19.05510701093631; -17.641229033227546; -14.397587635627058; -10.043501646832553; -8.941872495515229; -6.170008565857652; -4.486133067831925; -3.3151838271420493; -1.8255984916232992; -0.965869424263621]
a./b= [0.25000000000000017; 0.25; 0.2500000000000001; 0.2500000000000001; 0.24999999999999992; 0.24999999999999983; 0.25000000000000017; 0.25; 0.2500000000000001; 0.25000000000000017]
sum of ratios after svd = 2.5000000000000004
sum of diffs after svd = -86.84209119885725
a./b= [0.25000000000000017; 0.25; 0.2500000000000001; 0.2500000000000001; 0.24999999999999992; 0.24999999999999983; 0.25000000000000017; 0.25; 0.2500000000000001; 0.25000000000000017]
sum of ratios after svd = 2.5000000000000004
sum of diffs after svd = -86.84209119885725
a./b= [0.25000000000000017; 0.25; 0.2500000000000001; 0.2500000000000001; 0.24999999999999992; 0.24999999999999983; 0.25000000000000017; 0.25; 0.2500000000000001; 0.25000000000000017]
sum of ratios after svd = 2.5000000000000004
sum of diffs after svd = -86.84209119885725
a./b= [0.25000000000000017; 0.25; 0.2500000000000001; 0.2500000000000001; 0.24999999999999992; 0.24999999999999983; 0.25000000000000017; 0.25; 0.2500000000000001; 0.25000000000000017]
sum of ratios after svd = 2.5000000000000004
sum of diffs after svd = -86.84209119885725
approx= false
approx= false
approx= false
approx= false

My crude guess would be that there is a factor difference of 1/n (n=number of ranks). I built the version on Cori against Cray-MPICH. Any ideas @andreasnoack

andreasnoack commented 2 years ago

I can reproduce the bug on my laptop. My guess is that this is an issue with the conversion between Elemental arrays and DistributedArrays.DArray. If I just use the Elemental's DistMatrix then I don't get wrong results, i.e.

using Elemental
using LinearAlgebra

A = Elemental.DistMatrix(Float64)
Elemental.gaussian!(A, 10, 10)
Al = Array(A)
println("sum of diffs= ", sum(A - Al))

a = svdvals(Al)
b = svdvals(A)
println("a = ", a)
println("b = ", b)
println("a - b= ", a - b)
println("a ./ b= ", a ./ b)

@info "DONE!"
JBlaschke commented 2 years ago

Thanks @andreasnoack -- I just verified that this also works on Cori. This leaves an open question given what @dhiepler is trying to do with the original code: How to create an Elemental array from an existing array (if not from a DArray) ?

vchuravy commented 2 years ago
using MPIClusterManagers, Distributed

man = MPIManager(np = 4)
addprocs(man) # backend communicates with MPI

@everywhere begin
    using Elemental
    using LinearAlgebra
end

using DistributedArrays
A = drandn(50,50)
Al = Matrix(A)

a = svdvals(Al)
b = Array(Elemental.svdvals(A))

println("sum of diffs= ",sum(a-b))

@mpi_do man begin
    grid = Elemental.DefaultGrid()
    @show (Elemental.row(grid), Elemental.column(grid))
    A = Elemental.DistMatrix(Float64)
    Elemental.zeros!(A, 50,50)
    @show Elemental.localHeight(A), Elemental.height(A)
    @show Elemental.localWidth(A), Elemental.width(A)
end

On https://github.com/JuliaParallel/Elemental.jl/pull/77 resulted for me in:

      From worker 3:    (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
      From worker 2:    (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
      From worker 4:    (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
      From worker 5:    (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
      From worker 3:    (Elemental.localHeight(A), Elemental.height(A)) = (50, 50)
      From worker 2:    (Elemental.localHeight(A), Elemental.height(A)) = (50, 50)
      From worker 4:    (Elemental.localHeight(A), Elemental.height(A)) = (50, 50)
      From worker 3:    (Elemental.localWidth(A), Elemental.width(A)) = (50, 50)
      From worker 2:    (Elemental.localWidth(A), Elemental.width(A)) = (50, 50)
      From worker 5:    (Elemental.localHeight(A), Elemental.height(A)) = (50, 50)
      From worker 4:    (Elemental.localWidth(A), Elemental.width(A)) = (50, 50)
      From worker 5:    (Elemental.localWidth(A), Elemental.width(A)) = (50, 50)

Meaning that the Grid configuration did not work. I double checked my MPI.jl and I was using OpenBLAS_jll as my MPI provider (I would believe that using the system MPI would yield the same effect.). The default Elemental_jll we are shipping uses MPICH_jll

You can check yourself with:

julia> MPI.identify_implementation()
(MPI.MPICH, v"3.4.2")

After switching to MPICH_jll:

JULIA_MPI_BINARY="MPICH_jll" julia --project=. -e 'import Pkg; Pkg.build("MPI")'

I got:

vchuravy@odin ~/s/s/a/svd2> JULIA_PROJECT=(pwd) julia --project=. test.jl
      From worker 3:    (Elemental.row(grid), Elemental.column(grid)) = (1, 0)
      From worker 2:    (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
      From worker 4:    (Elemental.row(grid), Elemental.column(grid)) = (0, 1)
      From worker 5:    (Elemental.row(grid), Elemental.column(grid)) = (1, 1)
      From worker 3:    (Elemental.localHeight(A), Elemental.height(A)) = (25, 50)
      From worker 2:    (Elemental.localHeight(A), Elemental.height(A)) = (25, 50)
      From worker 4:    (Elemental.localHeight(A), Elemental.height(A)) = (25, 50)
      From worker 3:    (Elemental.localWidth(A), Elemental.width(A)) = (25, 50)
      From worker 2:    (Elemental.localWidth(A), Elemental.width(A)) = (25, 50)
      From worker 5:    (Elemental.localHeight(A), Elemental.height(A)) = (25, 50)
      From worker 4:    (Elemental.localWidth(A), Elemental.width(A)) = (25, 50)
      From worker 5:    (Elemental.localWidth(A), Elemental.width(A)) = (25, 50)

When https://github.com/JuliaParallel/MPI.jl/pull/513 lands we will be able to fix issues like this.

Now:

using MPIClusterManagers, Distributed

man = MPIManager(np = 4)
addprocs(man) # backend communicates with MPI

using Elemental
using LinearAlgebra
using DistributedArrays

A = drandn(50,50)
Al = Matrix(A)

a = svdvals(Al)
b = Array(Elemental.svdvals(A))

println("sum of diffs= ",sum(a-b))

yields sum of diffs= 1.2073675392798577e-14

JBlaschke commented 2 years ago

Thanks @vchuravy for looking into this also. Unfortunately it doesn't work on Cori. Is the MPIClusterManagers part necessary? (I find that it seems to hang/time out on Cori -- maybe we can look into why in a separate issue/part of the Julia for HPC group).

Taking out the MPIClusterManagers part, I get:

blaschke@nid00805:~/tmp> JULIA_MPI_BINARY="MPICH_jll" srun -n 4 julia test_sdvals_4.jl 
sum of diffs= -297.4855631120604
sum of diffs= -296.6150430552368
sum of diffs= -303.4465072750694
sum of diffs= -293.6032813641691

using the system MPICH (i.e. cray-mpich).

vchuravy commented 2 years ago

Yeah the MPIClusterManager is necessary for the variant of the code that uses DistributedArrays otherwise the ranks are not wired-up correctly.

JBlaschke commented 2 years ago

Oh Feck! So now I have to fix MPIClusterManager @vchuravy should I open an issue on https://github.com/JuliaParallel/MPIClusterManagers.jl?

vchuravy commented 2 years ago

should I open an issue

Please do :)

JBlaschke commented 2 years ago

Done!

JBlaschke commented 2 years ago

@vchuravy Based on our discussion in JuliaParallel/MPIClusterManagers.jl#26 your solution was to run the addprocs version without srun. That won't work here though, because without srun running:

using Elemental

throws

Wed Dec  8 12:00:48 2021: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1                                                                                                             
Wed Dec  8 12:00:48 2021: [unset]:_pmi_init:_pmi_alps_init returned -1                                                                                                                                      
[Wed Dec  8 12:00:59 2021] [c3-0c2s12n1] Fatal error in MPI_Init: Other MPI error, error stack:                                                                                                             
MPIR_Init_thread(537):                                                                                                                                                                                      
MPID_Init(246).......: channel initialization failed                                                                                                                                                        
MPID_Init(647).......:  PMI2 init failed: 1

when trying to initialize Elemental.jl ... we appear to have a catch 22 :(

JBlaschke commented 2 years ago

And if I avoid the addprocs and run the following within srun, I get:

using MPIClusterManagers, Distributed                            
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)  

using Elemental                                                  
using LinearAlgebra                                              
using DistributedArrays                                          

A = drandn(50,50)                                                
Al = Matrix(A)                                                   

a = svdvals(Al)                                                  
b = Array(Elemental.svdvals(A))                                  

println("sum of diffs= ",sum(a-b))                               
MPIClusterManagers.stop_main_loop(manager)                       

I get a deadlock.

vchuravy commented 2 years ago

Without srun, but with salloc?

JBlaschke commented 2 years ago

Elemental needs srun (without srun, Elemental won't be able to initialize Cray MPICH)

vchuravy commented 2 years ago

I am confused. MPIClusterManager should use srun/mpiexec to connect the "worker" processes which allows them to use Elemental and to communicate among themselves using MPI. The front-end should not need to run under mpiexec.

JBlaschke commented 2 years ago

I don't know all the details at this point, but if I run Elemental (remember, we don't use the one provided by BinaryBuilder -- that one doesn't use Cray MPICH, and therefore doesn't work on Cori -- but instead we build our own version using the Cray compiler wrappers) without srun, then I get:

MPIR_Init_thread(537):                                                                                                                                                                                      
MPID_Init(246).......: channel initialization failed                                                                                                                                                        
MPID_Init(647).......:  PMI2 init failed: 1

This is a common error when a program runs MPI_Init() without being launched in srun. I haven't had a chance to dig around the guts of MPIClusterManagers (I plan to). I suspect that MPIClusterManagers are launching the workers using srun, thereby avoiding the srun-less MPI_Init.

Here is a theory that I will try next: would wrapping using Elemental with a @mpi_do fix this? (I.e. we want to avoid the situation where anything except the workers will try to initialize MPI).

vchuravy commented 2 years ago

Oh yeah that makes sense. That is rather unfriendly behavior on the Cray side... We still load the Elemental library on the front-end process.

vchuravy commented 2 years ago

Maybe we can use the PMI cluster manager I wrote https://github.com/JuliaParallel/PMI.jl/blob/main/examples/distributed.jl

JBlaschke commented 2 years ago

Alas @vchuravy Cray MPICH wants PMI2 :(

cori01:~> export LD_LIBRARY_PATH=/usr/lib64/slurmpmi:$LD_LIBRARY_PATH
cori01:~> srun -n 2 -C haswell julia pmi_simp.jl
srun: job 51995644 queued and waiting for resources
srun: job 51995644 has been allocated resources
/global/common/cori_cle7/software/julia/julia/1.6.0/bin/julia: symbol lookup error: /opt/cray/pe/mpt/7.7.14/gni/mpich-intel/16.0/lib/libmpich.so: undefined symbol: PMI2_Init
srun: error: nid00915: task 0: Exited with exit code 127
srun: launch/slurm: _step_signal: Terminating StepId=51995644.0
srun: error: nid00915: task 1: Exited with exit code 1