Open YWang-east opened 2 years ago
Btw, the code is working properly on a single CPU/GPU.
That is a nasty issue. It looks like MPI is trying to send more than it should: 216000/8 == 30^3
Float64 numbers instead of 8192/8 == 32^2
Float64 numbers (as correctly available in the receiver buffer). At that point, I would guess it is an issue in MPI.jl, not being able to detect correctly the size of the abstract sendbuffer passed as argument to MPI.Isend
. I will try to verify this hypothesis and come up with a solution.
It results that there is an issue in your code that causes this behaviour. There is no issue in ImplicitGlobalGrid and its dependencies concerning the reported error.
When you get a low-level MPI error message using ImplicitGlobalGrid instead of a simple and clear error message from ImplicitGlobalGrid itself, then it will normally mean that the logic of your distributed parallelization is wrong, i.e. your programm does not make sense when executed in parallel. In your case, not all processes do the same calls to ImplicitGlobalGrid functions because the program control flow is different for process 0 then for the rest: processes with non-zero id abort the while loop earlier than the process 0 because the while loop condition is evaluated differently on process 0 (as it depends on gather!). Pay in particular attention to gather!; this function gathers the global array only on the root process (by default process 0) - on the other processes nothing is gathered! Also, for performance and scaling reasons, you should never do a gather! call every iteration in the innermost loop. Let me know if something is not clear...
When I tried to run my part1 code (Diffusion_3D) on a multi-XPU process, I kept receiving this error message from update_halo: ERROR: LoadError: MPIError(739389454): Message truncated, error stack: PMPI_Wait(204): MPI_Wait(request=0x14cb310bfab0, status=0x7fff89d02960) failed MPIR_Wait(104): do_cts(515)...: Message truncated; 216000 bytes received but buffer size is 8192 Stacktrace: [1] Wait!(req::MPI.Request) @ MPI ~/.julia/packages/MPI/90ZrE/src/pointtopoint.jl:405 [2] _update_halo!(fields::Array{Float64, 3}) @ ImplicitGlobalGrid ~/.julia/packages/ImplicitGlobalGrid/b8fz5/src/update_halo.jl:64 [3] update_halo! @ ~/.julia/packages/ImplicitGlobalGrid/b8fz5/src/update_halo.jl:27 [inlined]