JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.21k stars 219 forks source link

Error returned from CUDA function in CUDA-aware MPI multi-GPU test #2522

Open luciano-drozda opened 6 days ago

luciano-drozda commented 6 days ago

Describe the bug

CUDA-aware MPI multi-GPU test (available here) fails by returning the following error message:

<node>.cluster.3981Error returned from CUDA function.

<node>.cluster.3981julia: CUDA failure: cuCtxGetDevice() (at /home/scm/gitrepo/ifs-all/components/psm/temp.build-cuda/BUILD/libpsm2-11.2.201/psm_user.h:449)returned 201
<node>.cluster.3980julia: CUDA failure: cuCtxGetDevice() (at /home/scm/gitrepo/ifs-all/components/psm/temp.build-cuda/BUILD/libpsm2-11.2.201/psm_user.h:449)returned 201
<node>.cluster.3980Error returned from CUDA function.

[3981] signal (6.-6): Abandon
in expression starting at /scratch/drozda/alltoall_test_cuda_multigpu.jl:7
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)

[3980] signal (6.-6): Abandon
in expression starting at /scratch/drozda/alltoall_test_cuda_multigpu.jl:7
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x2b34f82054df)
unknown function (ip: 0x2b34f81f549a)
psm2_mq_send2 at /lib64/libpsm2.so.2 (unknown line)
ompi_mtl_psm2_send at /softs/local/openmpi/lib/openmpi/mca_mtl_psm2.so (unknown line)
mca_pml_cm_send at /softs/local/openmpi/lib/openmpi/mca_pml_cm.so (unknown line)
ompi_coll_base_sendrecv_actual at /softs/local/openmpi/lib/libmpi.so (unknown line)
ompi_coll_base_allreduce_intra_recursivedoubling at /softs/local/openmpi/lib/libmpi.so (unknown line)
ompi_coll_tuned_allreduce_intra_dec_fixed at /softs/local/openmpi/lib/openmpi/mca_coll_tuned.so (unknown line)
ompi_comm_split_type at /softs/local/openmpi/lib/libmpi.so (unknown line)
MPI_Comm_split_type at /softs/local/openmpi/lib/libmpi.so (unknown line)
MPI_Comm_split_type at /scratch/drozda/.julia/packages/MPI/TKXAj/src/api/generated_api.jl:1017 [inlined]
#Comm_split_type#2 at /scratch/drozda/.julia/packages/MPI/TKXAj/src/comm.jl:213 [inlined]
Comm_split_type at /scratch/drozda/.julia/packages/MPI/TKXAj/src/comm.jl:206
unknown function (ip: 0x2b33d9e6cfb3)
unknown function (ip: 0x2b725d0274df)
unknown function (ip: 0x2b725d01749a)
psm2_mq_send2 at /lib64/libpsm2.so.2 (unknown line)
ompi_mtl_psm2_send at /softs/local/openmpi/lib/openmpi/mca_mtl_psm2.so (unknown line)
mca_pml_cm_send at /softs/local/openmpi/lib/openmpi/mca_pml_cm.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
ompi_coll_base_sendrecv_actual at /softs/local/openmpi/lib/libmpi.so (unknown line)
ompi_coll_base_allreduce_intra_recursivedoubling at /softs/local/openmpi/lib/libmpi.so (unknown line)
ompi_coll_tuned_allreduce_intra_dec_fixed at /softs/local/openmpi/lib/openmpi/mca_coll_tuned.so (unknown line)
ompi_comm_split_type at /softs/local/openmpi/lib/libmpi.so (unknown line)
MPI_Comm_split_type at /softs/local/openmpi/lib/libmpi.so (unknown line)
MPI_Comm_split_type at /scratch/drozda/.julia/packages/MPI/TKXAj/src/api/generated_api.jl:1017 [inlined]
#Comm_split_type#2 at /scratch/drozda/.julia/packages/MPI/TKXAj/src/comm.jl:213 [inlined]
Comm_split_type at /scratch/drozda/.julia/packages/MPI/TKXAj/src/comm.jl:206
unknown function (ip: 0x2b713ec78fb3)
jl_apply at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_call at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:126
eval_value at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:223
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
eval_stmt_value at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:617
jl_interpret_toplevel_thunk at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_apply at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_call at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:126
eval_value at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:223
jl_toplevel_eval_flex at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/toplevel.c:934
eval_stmt_value at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:617
jl_toplevel_eval_flex at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/toplevel.c:877
ijl_toplevel_eval_in at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/toplevel.c:985
jl_interpret_toplevel_thunk at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/toplevel.c:877
ijl_toplevel_eval_in at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
include_string at ./loading.jl:2076
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
eval at ./boot.jl:385 [inlined]
include_string at ./loading.jl:2076
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
_include at ./loading.jl:2136
_include at ./loading.jl:2136
include at ./Base.jl:495
include at ./Base.jl:495
jfptr_include_46447.1 at /home/drozda/.julia/juliaup/julia-1.10.5+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
jfptr_include_46447.1 at /home/drozda/.julia/juliaup/julia-1.10.5+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
exec_options at ./client.jl:318
exec_options at ./client.jl:318
_start at ./client.jl:552
_start at ./client.jl:552
jfptr__start_82798.1 at /home/drozda/.julia/juliaup/julia-1.10.5+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
jfptr__start_82798.1 at /home/drozda/.julia/juliaup/julia-1.10.5+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
jl_apply at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/cli/loader_exe.c:58
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 3381622 (Pool: 3380170; Big: 1452); GC: 5
jl_apply at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/cli/loader_exe.c:58
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 3381584 (Pool: 3380131; Big: 1453); GC: 5
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node <node> exited on signal 6 (Aborted).
--------------------------------------------------------------------------

To reproduce

The Minimal Working Example (MWE) for this bug:

mpirun -np 2 julia alltoall_test_cuda_multigpu.jl
Manifest.toml

``` # This file is machine-generated - editing it directly is not advised julia_version = "1.10.5" manifest_format = "2.0" project_hash = "31dfb9584b92c0a34a515eea136f306518f1752e" [[deps.AbstractFFTs]] deps = ["LinearAlgebra"] git-tree-sha1 = "d92ad398961a3ed262d8bf04a1a2b8340f915fef" uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c" version = "1.5.0" [deps.AbstractFFTs.extensions] AbstractFFTsChainRulesCoreExt = "ChainRulesCore" AbstractFFTsTestExt = "Test" [deps.AbstractFFTs.weakdeps] ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4" Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40" [[deps.Adapt]] deps = ["LinearAlgebra", "Requires"] git-tree-sha1 = "6a55b747d1812e699320963ffde36f1ebdda4099" uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e" version = "4.0.4" weakdeps = ["StaticArrays"] [deps.Adapt.extensions] AdaptStaticArraysExt = "StaticArrays" [[deps.ArgTools]] uuid = "0dad84c5-d112-42e6-8d28-ef12dabb789f" version = "1.1.1" [[deps.Artifacts]] uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33" [[deps.Atomix]] deps = ["UnsafeAtomics"] git-tree-sha1 = "c06a868224ecba914baa6942988e2f2aade419be" uuid = "a9b6321e-bd34-4604-b9c9-b65b8de01458" version = "0.1.0" [[deps.BFloat16s]] deps = ["LinearAlgebra", "Printf", "Random", "Test"] git-tree-sha1 = "2c7cc21e8678eff479978a0a2ef5ce2f51b63dff" uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b" version = "0.5.0" [[deps.Base64]] uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f" [[deps.CEnum]] git-tree-sha1 = "389ad5c84de1ae7cf0e28e381131c98ea87d54fc" uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82" version = "0.5.0" [[deps.CUDA]] deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CUDA_Driver_jll", "CUDA_Runtime_Discovery", "CUDA_Runtime_jll", "Crayons", "DataFrames", "ExprTools", "GPUArrays", "GPUCompiler", "KernelAbstractions", "LLVM", "LLVMLoopInfo", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "NVTX", "Preferences", "PrettyTables", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "StaticArrays", "Statistics", "demumble_jll"] git-tree-sha1 = "e0725a467822697171af4dae15cec10b4fc19053" uuid = "052768ef-5323-5732-b1bb-66c8b64840ba" version = "5.5.2" [deps.CUDA.extensions] ChainRulesCoreExt = "ChainRulesCore" EnzymeCoreExt = "EnzymeCore" SpecialFunctionsExt = "SpecialFunctions" [deps.CUDA.weakdeps] ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4" EnzymeCore = "f151be2c-9106-41f4-ab19-57ee4f262869" SpecialFunctions = "276daf66-3868-5448-9aa4-cd146d93841b" [[deps.CUDA_Driver_jll]] deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"] git-tree-sha1 = "ccd1e54610c222fadfd4737dac66bff786f63656" uuid = "4ee394cb-3365-5eb0-8335-949819d2adfc" version = "0.10.3+0" [[deps.CUDA_Runtime_Discovery]] deps = ["Libdl"] git-tree-sha1 = "33576c7c1b2500f8e7e6baa082e04563203b3a45" uuid = "1af6417a-86b4-443c-805f-a4643ffb695f" version = "0.3.5" [[deps.CUDA_Runtime_jll]] deps = ["Artifacts", "CUDA_Driver_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "TOML"] git-tree-sha1 = "e43727b237b2879a34391eeb81887699a26f8f2f" uuid = "76a88914-d11a-5bdc-97e0-2f5a05c973a2" version = "0.15.3+0" [[deps.CodeTracking]] deps = ["InteractiveUtils", "UUIDs"] git-tree-sha1 = "7eee164f122511d3e4e1ebadb7956939ea7e1c77" uuid = "da1fd8a2-8d9e-5ec2-8556-3022fb5608a2" version = "1.3.6" [[deps.ColorTypes]] deps = ["FixedPointNumbers", "Random"] git-tree-sha1 = "b10d0b65641d57b8b4d5e234446582de5047050d" uuid = "3da002f7-5984-5a60-b8a6-cbb66c0b333f" version = "0.11.5" [[deps.Colors]] deps = ["ColorTypes", "FixedPointNumbers", "Reexport"] git-tree-sha1 = "362a287c3aa50601b0bc359053d5c2468f0e7ce0" uuid = "5ae59095-9a9b-59fe-a467-6f913c188581" version = "0.12.11" [[deps.Compat]] deps = ["TOML", "UUIDs"] git-tree-sha1 = "8ae8d32e09f0dcf42a36b90d4e17f5dd2e4c4215" uuid = "34da2185-b29b-5c13-b0c7-acf172513d20" version = "4.16.0" weakdeps = ["Dates", "LinearAlgebra"] [deps.Compat.extensions] CompatLinearAlgebraExt = "LinearAlgebra" [[deps.CompilerSupportLibraries_jll]] deps = ["Artifacts", "Libdl"] uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae" version = "1.1.1+0" [[deps.Crayons]] git-tree-sha1 = "249fe38abf76d48563e2f4556bebd215aa317e15" uuid = "a8cc5b0e-0ffa-5ad4-8c14-923d3ee1735f" version = "4.1.1" [[deps.DataAPI]] git-tree-sha1 = "abe83f3a2f1b857aac70ef8b269080af17764bbe" uuid = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a" version = "1.16.0" [[deps.DataFrames]] deps = ["Compat", "DataAPI", "DataStructures", "Future", "InlineStrings", "InvertedIndices", "IteratorInterfaceExtensions", "LinearAlgebra", "Markdown", "Missings", "PooledArrays", "PrecompileTools", "PrettyTables", "Printf", "Random", "Reexport", "SentinelArrays", "SortingAlgorithms", "Statistics", "TableTraits", "Tables", "Unicode"] git-tree-sha1 = "fb61b4812c49343d7ef0b533ba982c46021938a6" uuid = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" version = "1.7.0" [[deps.DataStructures]] deps = ["Compat", "InteractiveUtils", "OrderedCollections"] git-tree-sha1 = "1d0a14036acb104d9e89698bd408f63ab58cdc82" uuid = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8" version = "0.18.20" [[deps.DataValueInterfaces]] git-tree-sha1 = "bfc1187b79289637fa0ef6d4436ebdfe6905cbd6" uuid = "e2d170a0-9d28-54be-80f0-106bbe20a464" version = "1.0.0" [[deps.Dates]] deps = ["Printf"] uuid = "ade2ca70-3891-5945-98fb-dc099432e06a" [[deps.Distributed]] deps = ["Random", "Serialization", "Sockets"] uuid = "8ba89e20-285c-5b6f-9357-94700520ee1b" [[deps.DocStringExtensions]] deps = ["LibGit2"] git-tree-sha1 = "2fb1e02f2b635d0845df5d7c167fec4dd739b00d" uuid = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae" version = "0.9.3" [[deps.Downloads]] deps = ["ArgTools", "FileWatching", "LibCURL", "NetworkOptions"] uuid = "f43a241f-c20a-4ad4-852c-f6b1247861c6" version = "1.6.0" [[deps.ExprTools]] git-tree-sha1 = "27415f162e6028e81c72b82ef756bf321213b6ec" uuid = "e2ba6199-217a-4e67-a87a-7c52f15ade04" version = "0.1.10" [[deps.FileWatching]] uuid = "7b1f6079-737a-58dc-b8bc-7a2ca5c1b5ee" [[deps.FixedPointNumbers]] deps = ["Statistics"] git-tree-sha1 = "05882d6995ae5c12bb5f36dd2ed3f61c98cbb172" uuid = "53c48c17-4a7d-5ca2-90c5-79b7896eea93" version = "0.8.5" [[deps.Future]] deps = ["Random"] uuid = "9fa8497b-333b-5362-9e8d-4d0656e87820" [[deps.GPUArrays]] deps = ["Adapt", "GPUArraysCore", "LLVM", "LinearAlgebra", "Printf", "Random", "Reexport", "Serialization", "Statistics"] git-tree-sha1 = "62ee71528cca49be797076a76bdc654a170a523e" uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7" version = "10.3.1" [[deps.GPUArraysCore]] deps = ["Adapt"] git-tree-sha1 = "ec632f177c0d990e64d955ccc1b8c04c485a0950" uuid = "46192b85-c4d5-4398-a991-12ede77f4527" version = "0.1.6" [[deps.GPUCompiler]] deps = ["ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "PrecompileTools", "Preferences", "Scratch", "Serialization", "TOML", "TimerOutputs", "UUIDs"] git-tree-sha1 = "1d6f290a5eb1201cd63574fbc4440c788d5cb38f" uuid = "61eb1bfa-7361-4325-ad38-22787b887f55" version = "0.27.8" [[deps.Hwloc_jll]] deps = ["Artifacts", "JLLWrappers", "Libdl"] git-tree-sha1 = "dd3b49277ec2bb2c6b94eb1604d4d0616016f7a6" uuid = "e33a78d0-f292-5ffc-b300-72abe9b543c8" version = "2.11.2+0" [[deps.InlineStrings]] git-tree-sha1 = "45521d31238e87ee9f9732561bfee12d4eebd52d" uuid = "842dd82b-1e85-43dc-bf29-5d0ee9dffc48" version = "1.4.2" [deps.InlineStrings.extensions] ArrowTypesExt = "ArrowTypes" ParsersExt = "Parsers" [deps.InlineStrings.weakdeps] ArrowTypes = "31f734f8-188a-4ce0-8406-c8a06bd891cd" Parsers = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0" [[deps.InteractiveUtils]] deps = ["Markdown"] uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240" [[deps.InvertedIndices]] git-tree-sha1 = "0dc7b50b8d436461be01300fd8cd45aa0274b038" uuid = "41ab1584-1d38-5bbf-9106-f11c6c58b48f" version = "1.3.0" [[deps.IteratorInterfaceExtensions]] git-tree-sha1 = "a3f24677c21f5bbe9d2a714f95dcd58337fb2856" uuid = "82899510-4779-5014-852e-03e436cf321d" version = "1.0.0" [[deps.JLLWrappers]] deps = ["Artifacts", "Preferences"] git-tree-sha1 = "be3dc50a92e5a386872a493a10050136d4703f9b" uuid = "692b3bcd-3c85-4b1f-b108-f13ce0eb3210" version = "1.6.1" [[deps.JuliaInterpreter]] deps = ["CodeTracking", "InteractiveUtils", "Random", "UUIDs"] git-tree-sha1 = "2984284a8abcfcc4784d95a9e2ea4e352dd8ede7" uuid = "aa1ae85d-cabe-5617-a682-6adf51b2e16a" version = "0.9.36" [[deps.JuliaNVTXCallbacks_jll]] deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"] git-tree-sha1 = "af433a10f3942e882d3c671aacb203e006a5808f" uuid = "9c1d0b0a-7046-5b2e-a33f-ea22f176ac7e" version = "0.2.1+0" [[deps.KernelAbstractions]] deps = ["Adapt", "Atomix", "InteractiveUtils", "MacroTools", "PrecompileTools", "Requires", "StaticArrays", "UUIDs", "UnsafeAtomics", "UnsafeAtomicsLLVM"] git-tree-sha1 = "04e52f596d0871fa3890170fa79cb15e481e4cd8" uuid = "63c18a36-062a-441e-b654-da1e3ab1ce7c" version = "0.9.28" [deps.KernelAbstractions.extensions] EnzymeExt = "EnzymeCore" LinearAlgebraExt = "LinearAlgebra" SparseArraysExt = "SparseArrays" [deps.KernelAbstractions.weakdeps] EnzymeCore = "f151be2c-9106-41f4-ab19-57ee4f262869" LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf" [[deps.LLVM]] deps = ["CEnum", "LLVMExtra_jll", "Libdl", "Preferences", "Printf", "Requires", "Unicode"] git-tree-sha1 = "4ad43cb0a4bb5e5b1506e1d1f48646d7e0c80363" uuid = "929cbde3-209d-540e-8aea-75f648917ca0" version = "9.1.2" weakdeps = ["BFloat16s"] [deps.LLVM.extensions] BFloat16sExt = "BFloat16s" [[deps.LLVMExtra_jll]] deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "TOML"] git-tree-sha1 = "05a8bd5a42309a9ec82f700876903abce1017dd3" uuid = "dad2f222-ce93-54a1-a47d-0025e8a3acab" version = "0.0.34+0" [[deps.LLVMLoopInfo]] git-tree-sha1 = "2e5c102cfc41f48ae4740c7eca7743cc7e7b75ea" uuid = "8b046642-f1f6-4319-8d3c-209ddc03c586" version = "1.0.0" [[deps.LaTeXStrings]] git-tree-sha1 = "50901ebc375ed41dbf8058da26f9de442febbbec" uuid = "b964fa9f-0449-5b57-a5c2-d3ea65f4040f" version = "1.3.1" [[deps.LazyArtifacts]] deps = ["Artifacts", "Pkg"] uuid = "4af54fe1-eca0-43a8-85a7-787d91b784e3" [[deps.LibCURL]] deps = ["LibCURL_jll", "MozillaCACerts_jll"] uuid = "b27032c2-a3e7-50c8-80cd-2d36dbcbfd21" version = "0.6.4" [[deps.LibCURL_jll]] deps = ["Artifacts", "LibSSH2_jll", "Libdl", "MbedTLS_jll", "Zlib_jll", "nghttp2_jll"] uuid = "deac9b47-8bc7-5906-a0fe-35ac56dc84c0" version = "8.4.0+0" [[deps.LibGit2]] deps = ["Base64", "LibGit2_jll", "NetworkOptions", "Printf", "SHA"] uuid = "76f85450-5226-5b5a-8eaa-529ad045b433" [[deps.LibGit2_jll]] deps = ["Artifacts", "LibSSH2_jll", "Libdl", "MbedTLS_jll"] uuid = "e37daf67-58a4-590a-8e99-b0245dd2ffc5" version = "1.6.4+0" [[deps.LibSSH2_jll]] deps = ["Artifacts", "Libdl", "MbedTLS_jll"] uuid = "29816b5a-b9ab-546f-933c-edad1886dfa8" version = "1.11.0+1" [[deps.Libdl]] uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb" [[deps.LinearAlgebra]] deps = ["Libdl", "OpenBLAS_jll", "libblastrampoline_jll"] uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" [[deps.Logging]] uuid = "56ddb016-857b-54e1-b83d-db4d58db5568" [[deps.LoweredCodeUtils]] deps = ["JuliaInterpreter"] git-tree-sha1 = "96d2a4a668f5c098fb8a26ce7da53cde3e462a80" uuid = "6f1432cf-f94c-5a45-995e-cdbf5db27b0b" version = "3.0.3" [[deps.MPI]] deps = ["Distributed", "DocStringExtensions", "Libdl", "MPICH_jll", "MPIPreferences", "MPItrampoline_jll", "MicrosoftMPI_jll", "OpenMPI_jll", "PkgVersion", "PrecompileTools", "Requires", "Serialization", "Sockets"] git-tree-sha1 = "892676019c58f34e38743bc989b0eca5bce5edc5" uuid = "da04e1cc-30fd-572f-bb4f-1f8673147195" version = "0.20.22" [deps.MPI.extensions] AMDGPUExt = "AMDGPU" CUDAExt = "CUDA" [deps.MPI.weakdeps] AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e" CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba" [[deps.MPICH_jll]] deps = ["Artifacts", "CompilerSupportLibraries_jll", "Hwloc_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "MPIPreferences", "TOML"] git-tree-sha1 = "7715e65c47ba3941c502bffb7f266a41a7f54423" uuid = "7cb0a576-ebde-5e09-9194-50597f1243b4" version = "4.2.3+0" [[deps.MPIPreferences]] deps = ["Libdl", "Preferences"] git-tree-sha1 = "c105fe467859e7f6e9a852cb15cb4301126fac07" uuid = "3da0fdf6-3ccc-4f1b-acd9-58baa6c99267" version = "0.1.11" [[deps.MPItrampoline_jll]] deps = ["Artifacts", "CompilerSupportLibraries_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "MPIPreferences", "TOML"] git-tree-sha1 = "70e830dab5d0775183c99fc75e4c24c614ed7142" uuid = "f1f71cc9-e9ae-5b93-9b94-4fe0e1ad3748" version = "5.5.1+0" [[deps.MacroTools]] deps = ["Markdown", "Random"] git-tree-sha1 = "2fa9ee3e63fd3a4f7a9a4f4744a52f4856de82df" uuid = "1914dd2f-81c6-5fcd-8719-6d5c9610ff09" version = "0.5.13" [[deps.Markdown]] deps = ["Base64"] uuid = "d6f4376e-aef5-505a-96c1-9c027394607a" [[deps.MbedTLS_jll]] deps = ["Artifacts", "Libdl"] uuid = "c8ffd9c3-330d-5841-b78e-0817d7145fa1" version = "2.28.2+1" [[deps.MicrosoftMPI_jll]] deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"] git-tree-sha1 = "f12a29c4400ba812841c6ace3f4efbb6dbb3ba01" uuid = "9237b28f-5490-5468-be7b-bb81f5f5e6cf" version = "10.1.4+2" [[deps.Missings]] deps = ["DataAPI"] git-tree-sha1 = "ec4f7fbeab05d7747bdf98eb74d130a2a2ed298d" uuid = "e1d29d7a-bbdc-5cf2-9ac0-f12de2c33e28" version = "1.2.0" [[deps.MozillaCACerts_jll]] uuid = "14a3606d-f60d-562e-9121-12d972cd8159" version = "2023.1.10" [[deps.NVTX]] deps = ["Colors", "JuliaNVTXCallbacks_jll", "Libdl", "NVTX_jll"] git-tree-sha1 = "53046f0483375e3ed78e49190f1154fa0a4083a1" uuid = "5da4648a-3479-48b8-97b9-01cb529c0a1f" version = "0.3.4" [[deps.NVTX_jll]] deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"] git-tree-sha1 = "ce3269ed42816bf18d500c9f63418d4b0d9f5a3b" uuid = "e98f9f5b-d649-5603-91fd-7774390e6439" version = "3.1.0+2" [[deps.NetworkOptions]] uuid = "ca575930-c2e3-43a9-ace4-1e988b2c1908" version = "1.2.0" [[deps.OpenBLAS_jll]] deps = ["Artifacts", "CompilerSupportLibraries_jll", "Libdl"] uuid = "4536629a-c528-5b80-bd46-f80d51c5b363" version = "0.3.23+4" [[deps.OpenMPI_jll]] deps = ["Artifacts", "CompilerSupportLibraries_jll", "Hwloc_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "MPIPreferences", "TOML", "Zlib_jll"] git-tree-sha1 = "bfce6d523861a6c562721b262c0d1aaeead2647f" uuid = "fe0851c0-eecd-5654-98d4-656369965a5c" version = "5.0.5+0" [[deps.OrderedCollections]] git-tree-sha1 = "dfdf5519f235516220579f949664f1bf44e741c5" uuid = "bac558e1-5e72-5ebc-8fee-abe8a469f55d" version = "1.6.3" [[deps.Pkg]] deps = ["Artifacts", "Dates", "Downloads", "FileWatching", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "Serialization", "TOML", "Tar", "UUIDs", "p7zip_jll"] uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f" version = "1.10.0" [[deps.PkgVersion]] deps = ["Pkg"] git-tree-sha1 = "f9501cc0430a26bc3d156ae1b5b0c1b47af4d6da" uuid = "eebad327-c553-4316-9ea0-9fa01ccd7688" version = "0.3.3" [[deps.PooledArrays]] deps = ["DataAPI", "Future"] git-tree-sha1 = "36d8b4b899628fb92c2749eb488d884a926614d3" uuid = "2dfb63ee-cc39-5dd5-95bd-886bf059d720" version = "1.4.3" [[deps.PrecompileTools]] deps = ["Preferences"] git-tree-sha1 = "5aa36f7049a63a1528fe8f7c3f2113413ffd4e1f" uuid = "aea7be01-6a6a-4083-8856-8a6e6704d82a" version = "1.2.1" [[deps.Preferences]] deps = ["TOML"] git-tree-sha1 = "9306f6085165d270f7e3db02af26a400d580f5c6" uuid = "21216c6a-2e73-6563-6e65-726566657250" version = "1.4.3" [[deps.PrettyTables]] deps = ["Crayons", "LaTeXStrings", "Markdown", "PrecompileTools", "Printf", "Reexport", "StringManipulation", "Tables"] git-tree-sha1 = "1101cd475833706e4d0e7b122218257178f48f34" uuid = "08abe8d2-0d0c-5749-adfa-8a2ac140af0d" version = "2.4.0" [[deps.Printf]] deps = ["Unicode"] uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7" [[deps.REPL]] deps = ["InteractiveUtils", "Markdown", "Sockets", "Unicode"] uuid = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb" [[deps.Random]] deps = ["SHA"] uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" [[deps.Random123]] deps = ["Random", "RandomNumbers"] git-tree-sha1 = "4743b43e5a9c4a2ede372de7061eed81795b12e7" uuid = "74087812-796a-5b5d-8853-05524746bad3" version = "1.7.0" [[deps.RandomNumbers]] deps = ["Random"] git-tree-sha1 = "c6ec94d2aaba1ab2ff983052cf6a606ca5985902" uuid = "e6cf234a-135c-5ec9-84dd-332b85af5143" version = "1.6.0" [[deps.Reexport]] git-tree-sha1 = "45e428421666073eab6f2da5c9d310d99bb12f9b" uuid = "189a3867-3050-52da-a836-e630ba90ab69" version = "1.2.2" [[deps.Requires]] deps = ["UUIDs"] git-tree-sha1 = "838a3a4188e2ded87a4f9f184b4b0d78a1e91cb7" uuid = "ae029012-a4dd-5104-9daa-d747884805df" version = "1.3.0" [[deps.Revise]] deps = ["CodeTracking", "Distributed", "FileWatching", "JuliaInterpreter", "LibGit2", "LoweredCodeUtils", "OrderedCollections", "REPL", "Requires", "UUIDs", "Unicode"] git-tree-sha1 = "2d4e5de3ac1c348fd39ddf8adbef82aa56b65576" uuid = "295af30f-e4ad-537b-8983-00126c2a3abe" version = "3.6.1" [[deps.SHA]] uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce" version = "0.7.0" [[deps.Scratch]] deps = ["Dates"] git-tree-sha1 = "3bac05bc7e74a75fd9cba4295cde4045d9fe2386" uuid = "6c6a2e73-6563-6170-7368-637461726353" version = "1.2.1" [[deps.SentinelArrays]] deps = ["Dates", "Random"] git-tree-sha1 = "ff11acffdb082493657550959d4feb4b6149e73a" uuid = "91c51154-3ec4-41a3-a24f-3f23e20d615c" version = "1.4.5" [[deps.Serialization]] uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b" [[deps.Sockets]] uuid = "6462fe0b-24de-5631-8697-dd941f90decc" [[deps.SortingAlgorithms]] deps = ["DataStructures"] git-tree-sha1 = "66e0a8e672a0bdfca2c3f5937efb8538b9ddc085" uuid = "a2af1166-a08f-5f64-846c-94a0d3cef48c" version = "1.2.1" [[deps.SparseArrays]] deps = ["Libdl", "LinearAlgebra", "Random", "Serialization", "SuiteSparse_jll"] uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf" version = "1.10.0" [[deps.StaticArrays]] deps = ["LinearAlgebra", "PrecompileTools", "Random", "StaticArraysCore"] git-tree-sha1 = "eeafab08ae20c62c44c8399ccb9354a04b80db50" uuid = "90137ffa-7385-5640-81b9-e52037218182" version = "1.9.7" [deps.StaticArrays.extensions] StaticArraysChainRulesCoreExt = "ChainRulesCore" StaticArraysStatisticsExt = "Statistics" [deps.StaticArrays.weakdeps] ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4" Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2" [[deps.StaticArraysCore]] git-tree-sha1 = "192954ef1208c7019899fbf8049e717f92959682" uuid = "1e83bf80-4336-4d27-bf5d-d5a4f845583c" version = "1.4.3" [[deps.Statistics]] deps = ["LinearAlgebra", "SparseArrays"] uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2" version = "1.10.0" [[deps.StringManipulation]] deps = ["PrecompileTools"] git-tree-sha1 = "a6b1675a536c5ad1a60e5a5153e1fee12eb146e3" uuid = "892a3eda-7b42-436c-8928-eab12a02cf0e" version = "0.4.0" [[deps.SuiteSparse_jll]] deps = ["Artifacts", "Libdl", "libblastrampoline_jll"] uuid = "bea87d4a-7f5b-5778-9afe-8cc45184846c" version = "7.2.1+1" [[deps.TOML]] deps = ["Dates"] uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76" version = "1.0.3" [[deps.TableTraits]] deps = ["IteratorInterfaceExtensions"] git-tree-sha1 = "c06b2f539df1c6efa794486abfb6ed2022561a39" uuid = "3783bdb8-4a98-5b6b-af9a-565f29a5fe9c" version = "1.0.1" [[deps.Tables]] deps = ["DataAPI", "DataValueInterfaces", "IteratorInterfaceExtensions", "OrderedCollections", "TableTraits"] git-tree-sha1 = "598cd7c1f68d1e205689b1c2fe65a9f85846f297" uuid = "bd369af6-aec1-5ad0-b16a-f7cc5008161c" version = "1.12.0" [[deps.Tar]] deps = ["ArgTools", "SHA"] uuid = "a4e569a6-e804-4fa4-b0f3-eef7a1d5b13e" version = "1.10.0" [[deps.Test]] deps = ["InteractiveUtils", "Logging", "Random", "Serialization"] uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40" [[deps.TimerOutputs]] deps = ["ExprTools", "Printf"] git-tree-sha1 = "3a6f063d690135f5c1ba351412c82bae4d1402bf" uuid = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f" version = "0.5.25" [[deps.UUIDs]] deps = ["Random", "SHA"] uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4" [[deps.Unicode]] uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5" [[deps.UnsafeAtomics]] git-tree-sha1 = "6331ac3440856ea1988316b46045303bef658278" uuid = "013be700-e6cd-48c3-b4a1-df204f14c38f" version = "0.2.1" [[deps.UnsafeAtomicsLLVM]] deps = ["LLVM", "UnsafeAtomics"] git-tree-sha1 = "2d17fabcd17e67d7625ce9c531fb9f40b7c42ce4" uuid = "d80eeb9a-aca5-4d75-85e5-170c8b632249" version = "0.2.1" [[deps.Zlib_jll]] deps = ["Libdl"] uuid = "83775a58-1f1d-513f-b197-d71354ab007a" version = "1.2.13+1" [[deps.demumble_jll]] deps = ["Artifacts", "JLLWrappers", "Libdl"] git-tree-sha1 = "6498e3581023f8e530f34760d18f75a69e3a4ea8" uuid = "1e29f10c-031c-5a83-9565-69cddfc27673" version = "1.3.0+0" [[deps.libblastrampoline_jll]] deps = ["Artifacts", "Libdl"] uuid = "8e850b90-86db-534c-a0d3-1478176c7d93" version = "5.11.0+0" [[deps.nghttp2_jll]] deps = ["Artifacts", "Libdl"] uuid = "8e850ede-7688-5339-a07c-302acd2aaf8d" version = "1.52.0+1" [[deps.p7zip_jll]] deps = ["Artifacts", "Libdl"] uuid = "3f19e933-33d8-53b3-aaab-bd5110c3b7a0" version = "17.4.0+2" ```

Expected behavior

Test on alltoall_test_cuda_multigpu.jl should pass

Version info

Details on Julia:

Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, icelake-server)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)
Environment:
  JULIA_CUDA_USE_BINARYBUILDER = false
  JULIA_CUDA_MEMORY_POOL = none
  JULIA_REVISE_POLL = 1
  JULIA_DEPOT_PATH = /scratch/drozda/.julia
  LD_LIBRARY_PATH = /softs/local/openmpi/lib:/softs/nvidia/cuda-11.2/lib64:/softs/nvidia/cuda-11.2/extras/CUPTI/lib64:/softs/local/gcc/9.4.0/lib64:/softs/local_intel/ptscotch/6.0.5a/lib:/softs/local_intel/parmetis/403_64/lib:/softs/local_intel/phdf5/1.8.20/lib:/softs/local_intel/lib/:/softs/hip/lib:/softs/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64:/softs/intel/compilers_and_libraries_2018.1.163/linux/compiler/lib/intel64
  JULIA_LOAD_PATH = :/scratch/drozda/.julia/config/Project.toml

Details on CUDA:

CUDA runtime 11.2, local installation
CUDA driver 12.0
NVIDIA driver 525.60.13

CUDA libraries:
- CUBLAS: 11.4.1
- CURAND: 10.2.3
- CUFFT: 10.4.1
- CUSOLVER: 11.1.0
- CUSPARSE: 11.4.1
- CUPTI: 2020.3.1 (API 14.0.0)
- NVML: 12.0.0+525.60.13

Julia packages:
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0
- CUDA_Runtime_Discovery: 0.3.5

Toolchain:
- Julia: 1.10.5
- LLVM: 15.0.7

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false
- JULIA_CUDA_MEMORY_POOL: none

Preferences:
- CUDA_Runtime_jll.version: 11.2
- CUDA_Runtime_jll.local: true

4 devices:
  0: NVIDIA A30 (sm_80, 23.525 GiB / 24.000 GiB available)
  1: NVIDIA A30 (sm_80, 23.525 GiB / 24.000 GiB available)
  2: NVIDIA A30 (sm_80, 23.525 GiB / 24.000 GiB available)
  3: NVIDIA A30 (sm_80, 23.525 GiB / 24.000 GiB available)

Additional context

This is the Project.toml file I append to the JULIA_LOAD_PATH environment variable:

[extras]
MPIPreferences   = "3da0fdf6-3ccc-4f1b-acd9-58baa6c99267"
CUDA_Runtime_jll = "76a88914-d11a-5bdc-97e0-2f5a05c973a2"

[preferences.MPIPreferences]
_format = "1.0"
abi     = "OpenMPI"
binary  = "system"
libmpi  = "libmpi"
mpiexec = "mpiexec"

[preferences.CUDA_Runtime_jll]
local   = "true"
version = "11.2"
maleadt commented 5 days ago

Can you explain why you think this is a CUDA.jl issue? For one, you didn't share alltoall_test_cuda_multigpu.jl, but even then the fact that you're running under mpirun and that the call to cuCtxGetDevice comes from libpsm2 seems to indicate that this hasn't anything to do with CUDA.jl. We only very rarely call cuCtxGetDevice.

luciano-drozda commented 4 days ago

@maleadt, I shared the file alltoall_test_cuda_multigpu.jl via gist link in the beginning of the message, but here it goes:

using MPI
using CUDA
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
# select device
comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
rank_l = MPI.Comm_rank(comm_l)
gpu_id = CUDA.device!(rank_l)
# select device
size = MPI.Comm_size(comm)
dst  = mod(rank+1, size)
src  = mod(rank-1, size)
println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id), size=$size, dst=$dst, src=$src")
N = 4
send_mesg = CuArray{Float64}(undef, N)
recv_mesg = CuArray{Float64}(undef, N)
fill!(send_mesg, Float64(rank))
CUDA.synchronize()
rank==0 && println("start sending...")
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank_l: $recv_mesg")
rank==0 && println("done.")

I thought it was a CUDA.jl issue as the error message stated "Error returned from CUDA function".

Do you think it comes from my openmpi, cuda drivers installations ? Otherwise, could it come from MPI.jl ?

maleadt commented 4 days ago

Sorry, I glossed over the link.

I thought it was a CUDA.jl issue as the error message stated "Error returned from CUDA function".

CUDA.jl is a Julia interface to the CUDA library, which throws an error here.

I'm not familiar with MPI, MPI.jl or Open MPI. Maybe try opening a post on Discourse where other (i.e. HPC) users can chime in, or ask on the relevant channels on Slack.