EnzymeAD / Enzyme.jl

Julia bindings for the Enzyme automatic differentiator
https://enzyme.mit.edu
MIT License
422 stars 58 forks source link

MPI ANY_TAG #512

Open michel2323 opened 1 year ago

michel2323 commented 1 year ago

I implemented a simple halo exchange:

using MPI
using Enzyme

struct Heat
    Tnext::Vector{Float64}
end

function halo(heat)
    next = heat.Tnext
    np = MPI.Comm_size(MPI.COMM_WORLD)
    rank = MPI.Comm_rank(MPI.COMM_WORLD)
    requests = Vector{MPI.Request}()
    if rank != 0
        push!(requests, MPI.Isend(next[2:2], MPI.COMM_WORLD; dest=rank-1))
        push!(requests, MPI.Irecv!(next[1:1], MPI.COMM_WORLD; source=rank-1))
    end
    if rank != np-1
        push!(requests, MPI.Isend(next[end-1:end-1], MPI.COMM_WORLD; dest=rank+1))
        push!(requests, MPI.Irecv!(next[end:end], MPI.COMM_WORLD; source=rank+1))
    end
    for request in requests
        MPI.Wait!(request)
    end
    return nothing
end

MPI.Init()
heat = Heat(ones(10))
halo(heat)

dheat = Heat(zeros(10))
autodiff(halo, Duplicated(heat, dheat))
MPI.Barrier(MPI.COMM_WORLD)
MPI.Finalize()

I'm using Julia 1.8.2 and MPI artifact with

  [d360d2e6] ChainRulesCore v1.15.6
  [864edb3b] DataStructures v0.18.13
  [7da242da] Enzyme v0.10.11
  [f67ccb44] HDF5 v0.16.11
  [da04e1cc] MPI v0.20.2
  [e88e6eb3] Zygote v0.6.49
  [37e2e46d] LinearAlgebra
  [9e88b42a] Serialization

Logs are attached with 1 rank and 2 ranks. I think they don't differ much though. out_2.log out_1.log The barrier seems needed as I get MPI calls after MPI.FInalize was called.

wsmoses commented 1 year ago

@michel2323 try https://github.com/EnzymeAD/Enzyme.jl/pull/513

michel2323 commented 1 year ago

@michel2323 try #513

@wsmoses Gives the same log to my eyes. Attached them here to be sure. out_22.log out_11.log

michel2323 commented 1 year ago

Do I need to build Enzyme proper?

wsmoses commented 1 year ago

These are now indeed different errors.

wsmoses commented 1 year ago

@michel2323 try a building enzyme proper with https://github.com/EnzymeAD/Enzyme/pull/897

michel2323 commented 1 year ago

Differentiation works now! However, the differentiated code deadlocks in an MPI_Wait. I will try to find out why.

michel2323 commented 1 year ago

I went back to MPI@0.19 with the following signatures for the sends/receives:

 push!(requests, MPI.Isend(next[2:2], rank-1, 0, MPI.COMM_WORLD))
 push!(requests, MPI.Irecv!(next[1:1], rank-1, 0, MPI.COMM_WORLD))

This worked. Adding the an explicit tag value with MPI@0.20 gets rid of the deadlock.

push!(requests, MPI.Isend(next[2:2], MPI.COMM_WORLD; dest=rank-1, tag=0))
push!(requests, MPI.Irecv!(next[1:1], MPI.COMM_WORLD; source=rank-1, tag=0))

However, as per MPI.jl doc that's also the default value for the tag https://juliaparallel.org/MPI.jl/stable/reference/pointtopoint/#Initiation , so I don't really understand.

michel2323 commented 1 year ago

NVM, Enzyme does not seem to handle the default tag MPI.ANY_TAG of the MPI.Irecv!.

vchuravy commented 1 year ago

@michel2323 can I ask you to add MPI test to Enzyme.jl?

wsmoses commented 1 year ago

Add an Enzyme.API.printall!(true) to see what is happening?