JuliaLang / Distributed.jl

Create and control multiple Julia processes remotely for distributed computing. Ships as a Julia stdlib.
https://docs.julialang.org/en/v1/stdlib/Distributed/
MIT License
23 stars 9 forks source link

SharedArray not working on remote machines #32

Open pearcemc opened 8 years ago

pearcemc commented 8 years ago

I am trying to set up SharedArrays on remote machines. I.e. shared among processes on the same machine. Unfortunately this doesn't seem to work. I am using version 0.5.0-dev+749 of julia.

julia> addprocs(1) #add one process on the master node
1-element Array{Int64,1}:
 2

julia> println(procs())
[1,2]

julia> S = SharedArray(Int, (3,4), init = S -> S[localindexes(S)] = myid(), pids=Int[1,2])
3x4 SharedArray{Int64,2}:
 1  1  2  2
 1  1  2  2
 1  1  2  2

julia> using ClusterManagers

julia> remotes = addprocs(SlurmManager(2), nodes=1)
srun: job 1828134 queued and waiting for resources
srun: job 1828134 has been allocated resources
2-element Array{Int64,1}:t of 2
 3
 4

julia> for w in remotes #both remote processes are on the same machine
           println(remotecall_fetch(readall, w, `hostname`)) 
       end
mrc-bsu-tesla1

mrc-bsu-tesla1

julia> r = @spawnat remotes[1] S = SharedArray(Int, (3,4), init = S -> S[localindexes(S)] = myid(), pids=remotes)
RemoteRef{Channel{Any}}(3,1,14)

julia> fetch(r)
3x4 SharedArray{Int64,2}:
 #undef  #undef  #undef  #undef
 #undef  #undef  #undef  #undef
 #undef  #undef  #undef  #undef

julia> r = @spawnat remotes[1] S*eye(4) #convert to regular array
RemoteRef{Channel{Any}}(3,1,16)

julia> fetch(r)
ERROR: On worker 3:
UndefRefError: access to undefined reference
 [inlined code] from sharedarray.jl:294
 in copy_transpose! at abstractarray.jl:513
 in copy_transpose! at linalg/matmul.jl:349
 in generic_matmatmul! at linalg/matmul.jl:459
 in * at linalg/matmul.jl:144
 in anonymous at multi.jl:1330
 in anonymous at multi.jl:889
 in run_work_thunk at multi.jl:645
 in run_work_thunk at multi.jl:654
 in anonymous at task.jl:54
 in remotecall_fetch at multi.jl:731
 [inlined code] from multi.jl:368
 in call_on_owner at multi.jl:776
 in fetch at multi.jl:784
amitmurthy commented 8 years ago

Do you see the same error with

S = SharedArray(Int, (3,4), init = S -> S[localindexes(S)] = myid(), pids=remotes)
r = @spawnat remotes[1] S*eye(4)
fetch(r)
pearcemc commented 8 years ago

Amit, thanks for the suggestion: there is progress.

julia> S = SharedArray(Int, (3,4), init = S -> S[localindexes(S)] = myid(), pids=remotes)
3x4 SharedArray{Int64,2}:
 #undef  #undef  #undef  #undef
 #undef  #undef  #undef  #undef
 #undef  #undef  #undef  #undef

julia> r = @spawnat remotes[1] S*eye(4)
RemoteRef{Channel{Any}}(5,1,115)

julia> fetch(r)
3x4 Array{Float64,2}:
 5.0  5.0  6.0  6.0
 5.0  5.0  6.0  6.0
 5.0  5.0  6.0  6.0

It's still a bit disturbing to see undefs. I guess this is because the master process doesn't have access to that memory and there's no code pulling it from the remote. I'll try to get over it as it looks like some work can be done now!

I take it your suggestion means that the SharedArray code is designed to be invoked from the master process (can't believe I didn't try this permutation).

Much appreciated.

amitmurthy commented 8 years ago

The previous invocation, while a bit inefficient should have worked too. And we could do a better job of "show" on unmapped workers.

amitmurthy commented 8 years ago

The reason for the undefined ref error in your case was because S in the invocation of S*eye(4) on worker 3 was actually the one constructed on pids 1 and 2.

That leaves only the issue of a better show

pearcemc commented 8 years ago

Further issues include map(f, S::SharedArray) apparently not working for SharedArrays hosted entirely on remote machines.

The setup:

julia> @everywhere using ClusterManagers

julia> @everywhere blas_set_num_threads(12)

julia> @everywhere topo = describepids(remote=2)   #from my ClusterManagers/utils branch

julia> function read2remotes(fpath::AbstractString, dims, elty::DataType, topo)
           remote_shared_arrays = Dict([])
           @sync begin
           @async for pid in keys(topo)
               remotes_on_same_host = topo[pid]
               remote_shared_arrays[pid] = SharedArray(fpath, elty, dims, pids=remotes_on_same_host)
           end
           end
           remote_shared_arrays
       end
read2remotes (generic function with 1 method)

julia> dims = (2000,36)
(2000,36)

julia> rsay = read2remotes(fps[1], dims, Float32, topo)
Dict{Any,Any} with 3 entries:
  36 => 2000x36 SharedArray{Float32,2}:…
  12 => 2000x36 SharedArray{Float32,2}:…
  24 => 2000x36 SharedArray{Float32,2}:…

The problem:

julia> map(abs, rsay[12]) #works fine on sharedarray in local memory
2000x36 Array{Float32,2}:
 3.14159    1.04518    1.63132    1.94399     1.3988     1.08292    1.06984     0.535975   1.30225    …  0.035658   0.172445   4.89128    3.39776    4.79956    0.562555    0.207192   1.8894   
 0.336003   0.540138   1.51911    0.325294    1.2718     2.46567    0.956635    1.69363    1.17613       0.369635   0.0238586  0.827886   1.26854    0.79305    0.137688    1.25333    1.08266  

julia> map(abs, rsay[36]) #fails on sharedarray on remote machine
ERROR: UndefRefError: access to undefined reference
 in similar at sharedarray.jl:351
 in map at sharedarray.jl:353

julia> fetch(@spawnat 36 map(abs, rsay[36])) #works remotely executed
2000x36 Array{Float32,2}:
 3.14159    1.04518    1.63132    1.94399     1.3988     1.08292    1.06984     0.535975   1.30225    …  0.035658   0.172445   4.89128    3.39776    4.79956    0.562555    0.207192   1.8894   
 0.336003   0.540138   1.51911    0.325294    1.2718     2.46567    0.956635    1.69363    1.17613       0.369635   0.0238586  0.827886   1.26854    0.79305    0.137688    1.25333    1.08266 

This shows an inconsistency between sharedarray creation - which we found had to be executed on the local machine - and sharedarray computation - which appears we need to execute on the remote host.

amitmurthy commented 8 years ago

This is by design. You need to execute computation on the host where the shmem is mapped else we would just be pulling the entire array over the network.

Shared array creation can happen from any host. As long as all the pids specified are on the same machine.

pearcemc commented 8 years ago

Hi Amit, does that last comment mean the bug with respect to SharedArray creation given above is now gone on master?

amitmurthy commented 8 years ago

It didn't exist. See my comment above - https://github.com/JuliaLang/Distributed.jl/issues/32

jakebolewski commented 8 years ago

I think you are misusing the parallel features here. You have to send the computation to the data and not the other way around.

pearcemc commented 8 years ago

@jakebolewski thanks for the tip. As you can see from the examples I have tried something that works. It would help somewhat if there was further documentation: otherwise being able to create remotely hosted SharedArrays on the local machine runs somewhat counter to that model. (at least enough to confuse newbies like me).

@amitmurthy, thanks I get your comment now.

Also, I think I have something that helps with the #undefs when printing remote SharedArrays:

@everywhere function getrepresentation(S)
    buf = IOBuffer()
    td = TextDisplay(buf)
    Base.Multimedia.display(td, S)
    str = takebuf_string(buf)
    return str
end

function Base.display(S::SharedArray{Float32, 2})
    validpid = minimum(S.pids)
    repr = @fetchfrom validpid getrepresentation(S)
    print_with_color(:bold, repr)
end

Clearly if this made it into sharedarray.jl then the @everywhere would be redundant. The eltype part of the function could go, but I haven't tested that.

There is slightly different behaviour depending on whether the host is local or remote. I guess this is to do with the initialisation of julia on the remotes and not printing out so many rows/columns of a matrix. I can't find the setting however.

Would it be worth submitting this, and if sowhich git branch etc. should it be done through?

kshyatt commented 8 years ago

Hi @pearcemc, are/were you considering submitting a documentation PR or a change to the way SharedArrays work? Either is fine - you can look at CONTRIBUTING.md in the top level Julia directory for tips (I personally find it easier to read on GitHub). If that's not sufficient, I'd be happy to help you on IRC on #julia at freenode, over Gitter, or over email.