JuliaParallel / Elemental.jl

Julia interface to the Elemental linear algebra library.
Other
78 stars 15 forks source link

How to go from files to distributed matrix. #74

Open dh-ilight opened 2 years ago

dh-ilight commented 2 years ago

I have files each holding 1 column of an array. I would like to create an Elemental.DistMatrix from these files. I would like to load the DistMatrix in parallel. An earlier question was answered by pointing to Elemental/test/lav.jl I made the following program by extracting from lav.jl. It works for a single node and hangs for 2 nodes using mpiexecjl. I am using Julia 1.5 on a 4 core machine running Centos 7.5 Please let me know what is wrong with the program and how to load my column array files in parallel. I intend to eventually run a program using DistMatrix on a computer with hundreds of cores.

# to import MPIManager
using MPIClusterManagers, Distributed

# Manage MPIManager manually -- all MPI ranks do the same work
# Start MPIManager
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)

# Init an Elemental.DistMatrix
@everywhere function spread(n0, n1)
println("start spread")
height = n0*n1
width = n0*n1
h= El.Dist(n0)
w= El.Dist(n1)
A = El.DistMatrix(Float64)
El.gaussian!(A, n0, n1) # how to init size ?
localHeight = El.localHeight(A)
println("localHeight ", localHeight)
El.reserve(A, 6*localHeight) # number of queue entries
println("after reserve")
for sLoc in 1:localHeight
s = El.globalRow(A, sLoc)
x0 = ((s-1) % n0) + 1
x1 = div((s-1), n0) + 1
El.queueUpdate(A, s, s, 11.0)
println("sLoc $sLoc, x0 $x0")
if x0 > 1
El.queueUpdate(A, s, s - 1, -10.0)
println("after q")
end
if x0 < n0
El.queueUpdate(A, s, s + 1, 20.0)
end
if x1 > 1
El.queueUpdate(A, s, s - n0, -30.0)
end
if x1 < n1
El.queueUpdate(A, s, s + n0, 40.0)
end
# The dense last column
# El.queueUpdate(A, s, width, floor(-10/height))
end # for
println("before processQueues")
El.processQueues(A)
println("after processQueues") # with 2 nodes never gets here
return A
end

@mpi_do manager begin
using MPI, LinearAlgebra, Elemental
const El = Elemental
res = spread(4,4)
println( "res=" , res)

# Manage MPIManager manually:
# Elemental needs to be finalized before shutting down MPIManager
# println("[rank $(MPI.Comm_rank(comm))]: Finalizing Elemental")
Elemental.Finalize()
# println("[rank $(MPI.Comm_rank(comm))]: Done finalizing Elemental")
end # mpi_do

# Shut down MPIManager
MPIClusterManagers.stop_main_loop(manager)

Thank you

JBlaschke commented 2 years ago

Based on a NERSC user ticket which inspired #73 @andreasnoack

~@dhiepler can you put the code snippet in a code block (put ```julia at the beginning and ``` at the end)~

andreasnoack commented 2 years ago

The program looks right to me. To debug this, I'd try to remove the MPIClusterManagers, Distributed parts and then run the script with mpiexec like we do in https://github.com/JuliaParallel/Elemental.jl/blob/83089155659739fea1aae476c6fd492b1ee20850/test/runtests.jl#L19

JBlaschke commented 2 years ago

FTR @dhiepler on Cori that would be

srun -n $NUM_RANKS julia path/to/test.jl