Open tkf opened 3 years ago
As a short-term solution, maybe documenting that
mapreduce
assumes commutative operators would reduce some confusions?
I'm not sure people actually read docstrings for specific mapreduce
methods.
Maybe we should add an algorithm
or commutative
keyword argument though, where we can switch the grid-stride loop to one compatible with non-commutative operators. That may also allow us to partially revert https://github.com/JuliaGPU/CUDA.jl/pull/500, because IIRC that also resulted in worse coalescing (as visualized in https://sodocumentation.net/cuda/topic/6566/parallel-reduction--e-g--how-to-sum-an-array-).
Yeah, the opt-in commutativity assumption sounds like a good idea to me.
(+ Thanks for sharing an interesting article!)
Describe the bug
mapreduce(f, op, ::CuArray)
assumes commutativeop
.I should've paid more attention in https://github.com/JuliaGPU/CUDA.jl/pull/500 but it didn't solve https://github.com/JuliaGPU/CUDA.jl/issues/484.
To reproduce
MWE:
Manifest.toml
``` # This file is machine-generated - editing it directly is not advised [[AbstractFFTs]] deps = ["LinearAlgebra"] git-tree-sha1 = "485ee0867925449198280d4af84bdb46a2a404d0" uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c" version = "1.0.1" [[Adapt]] deps = ["LinearAlgebra"] git-tree-sha1 = "84918055d15b3114ede17ac6a7182f68870c16f7" uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e" version = "3.3.1" [[ArgTools]] uuid = "0dad84c5-d112-42e6-8d28-ef12dabb789f" [[Artifacts]] uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33" [[BFloat16s]] deps = ["LinearAlgebra", "Test"] git-tree-sha1 = "4af69e205efc343068dc8722b8dfec1ade89254a" uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b" version = "0.1.0" [[Base64]] uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f" [[CEnum]] git-tree-sha1 = "215a9aa4a1f23fbd05b92769fdd62559488d70e9" uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82" version = "0.4.1" [[CUDA]] deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CompilerSupportLibraries_jll", "DataStructures", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "SpecialFunctions", "TimerOutputs"] git-tree-sha1 = "5e696e37e51b01ae07bd9f700afe6cbd55250bce" uuid = "052768ef-5323-5732-b1bb-66c8b64840ba" version = "3.3.4" [[ChainRulesCore]] deps = ["Compat", "LinearAlgebra", "SparseArrays"] git-tree-sha1 = "f53ca8d41e4753c41cdafa6ec5f7ce914b34be54" uuid = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4" version = "0.10.13" [[Compat]] deps = ["Base64", "Dates", "DelimitedFiles", "Distributed", "InteractiveUtils", "LibGit2", "Libdl", "LinearAlgebra", "Markdown", "Mmap", "Pkg", "Printf", "REPL", "Random", "SHA", "Serialization", "SharedArrays", "Sockets", "SparseArrays", "Statistics", "Test", "UUIDs", "Unicode"] git-tree-sha1 = "dc7dedc2c2aa9faf59a55c622760a25cbefbe941" uuid = "34da2185-b29b-5c13-b0c7-acf172513d20" version = "3.31.0" [[CompilerSupportLibraries_jll]] deps = ["Artifacts", "Libdl"] uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae" [[DataStructures]] deps = ["Compat", "InteractiveUtils", "OrderedCollections"] git-tree-sha1 = "4437b64df1e0adccc3e5d1adbc3ac741095e4677" uuid = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8" version = "0.18.9" [[Dates]] deps = ["Printf"] uuid = "ade2ca70-3891-5945-98fb-dc099432e06a" [[DelimitedFiles]] deps = ["Mmap"] uuid = "8bb1440f-4735-579b-a4ab-409b98df4dab" [[Distributed]] deps = ["Random", "Serialization", "Sockets"] uuid = "8ba89e20-285c-5b6f-9357-94700520ee1b" [[DocStringExtensions]] deps = ["LibGit2"] git-tree-sha1 = "a32185f5428d3986f47c2ab78b1f216d5e6cc96f" uuid = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae" version = "0.8.5" [[Downloads]] deps = ["ArgTools", "LibCURL", "NetworkOptions"] uuid = "f43a241f-c20a-4ad4-852c-f6b1247861c6" [[ExprTools]] git-tree-sha1 = "b7e3d17636b348f005f11040025ae8c6f645fe92" uuid = "e2ba6199-217a-4e67-a87a-7c52f15ade04" version = "0.1.6" [[GPUArrays]] deps = ["AbstractFFTs", "Adapt", "LinearAlgebra", "Printf", "Random", "Serialization", "Statistics"] git-tree-sha1 = "ececbf05f8904c92814bdbd0aafd5540b0bf2e9a" uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7" version = "7.0.1" [[GPUCompiler]] deps = ["DataStructures", "ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "TimerOutputs", "UUIDs"] git-tree-sha1 = "e8a09182a4440489e2e3dedff5ad3f6bbe555396" uuid = "61eb1bfa-7361-4325-ad38-22787b887f55" version = "0.12.5" [[InteractiveUtils]] deps = ["Markdown"] uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240" [[JLLWrappers]] deps = ["Preferences"] git-tree-sha1 = "642a199af8b68253517b80bd3bfd17eb4e84df6e" uuid = "692b3bcd-3c85-4b1f-b108-f13ce0eb3210" version = "1.3.0" [[LLVM]] deps = ["CEnum", "LLVMExtra_jll", "Libdl", "Printf", "Unicode"] git-tree-sha1 = "1b7ba36ea7aa6fa2278118951bad114fbb8359f2" uuid = "929cbde3-209d-540e-8aea-75f648917ca0" version = "4.1.0" [[LLVMExtra_jll]] deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"] git-tree-sha1 = "b36c0677a0549c7d1dc8719899a4133abbfacf7d" uuid = "dad2f222-ce93-54a1-a47d-0025e8a3acab" version = "0.0.6+0" [[LazyArtifacts]] deps = ["Artifacts", "Pkg"] uuid = "4af54fe1-eca0-43a8-85a7-787d91b784e3" [[LibCURL]] deps = ["LibCURL_jll", "MozillaCACerts_jll"] uuid = "b27032c2-a3e7-50c8-80cd-2d36dbcbfd21" [[LibCURL_jll]] deps = ["Artifacts", "LibSSH2_jll", "Libdl", "MbedTLS_jll", "Zlib_jll", "nghttp2_jll"] uuid = "deac9b47-8bc7-5906-a0fe-35ac56dc84c0" [[LibGit2]] deps = ["Base64", "NetworkOptions", "Printf", "SHA"] uuid = "76f85450-5226-5b5a-8eaa-529ad045b433" [[LibSSH2_jll]] deps = ["Artifacts", "Libdl", "MbedTLS_jll"] uuid = "29816b5a-b9ab-546f-933c-edad1886dfa8" [[Libdl]] uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb" [[LinearAlgebra]] deps = ["Libdl"] uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" [[LogExpFunctions]] deps = ["DocStringExtensions", "LinearAlgebra"] git-tree-sha1 = "7bd5f6565d80b6bf753738d2bc40a5dfea072070" uuid = "2ab3a3ac-af41-5b50-aa03-7779005ae688" version = "0.2.5" [[Logging]] uuid = "56ddb016-857b-54e1-b83d-db4d58db5568" [[Markdown]] deps = ["Base64"] uuid = "d6f4376e-aef5-505a-96c1-9c027394607a" [[MbedTLS_jll]] deps = ["Artifacts", "Libdl"] uuid = "c8ffd9c3-330d-5841-b78e-0817d7145fa1" [[Mmap]] uuid = "a63ad114-7e13-5084-954f-fe012c677804" [[MozillaCACerts_jll]] uuid = "14a3606d-f60d-562e-9121-12d972cd8159" [[NetworkOptions]] uuid = "ca575930-c2e3-43a9-ace4-1e988b2c1908" [[OpenSpecFun_jll]] deps = ["Artifacts", "CompilerSupportLibraries_jll", "JLLWrappers", "Libdl", "Pkg"] git-tree-sha1 = "13652491f6856acfd2db29360e1bbcd4565d04f1" uuid = "efe28fd5-8261-553b-a9e1-b2916fc3738e" version = "0.5.5+0" [[OrderedCollections]] git-tree-sha1 = "85f8e6578bf1f9ee0d11e7bb1b1456435479d47c" uuid = "bac558e1-5e72-5ebc-8fee-abe8a469f55d" version = "1.4.1" [[Pkg]] deps = ["Artifacts", "Dates", "Downloads", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "Serialization", "TOML", "Tar", "UUIDs", "p7zip_jll"] uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f" [[Preferences]] deps = ["TOML"] git-tree-sha1 = "00cfd92944ca9c760982747e9a1d0d5d86ab1e5a" uuid = "21216c6a-2e73-6563-6e65-726566657250" version = "1.2.2" [[Printf]] deps = ["Unicode"] uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7" [[REPL]] deps = ["InteractiveUtils", "Markdown", "Sockets", "Unicode"] uuid = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb" [[Random]] deps = ["Serialization"] uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" [[Random123]] deps = ["Libdl", "Random", "RandomNumbers"] git-tree-sha1 = "0e8b146557ad1c6deb1367655e052276690e71a3" uuid = "74087812-796a-5b5d-8853-05524746bad3" version = "1.4.2" [[RandomNumbers]] deps = ["Random", "Requires"] git-tree-sha1 = "441e6fc35597524ada7f85e13df1f4e10137d16f" uuid = "e6cf234a-135c-5ec9-84dd-332b85af5143" version = "1.4.0" [[Reexport]] git-tree-sha1 = "5f6c21241f0f655da3952fd60aa18477cf96c220" uuid = "189a3867-3050-52da-a836-e630ba90ab69" version = "1.1.0" [[Requires]] deps = ["UUIDs"] git-tree-sha1 = "4036a3bd08ac7e968e27c203d45f5fff15020621" uuid = "ae029012-a4dd-5104-9daa-d747884805df" version = "1.1.3" [[SHA]] uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce" [[Serialization]] uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b" [[SharedArrays]] deps = ["Distributed", "Mmap", "Random", "Serialization"] uuid = "1a1011a3-84de-559e-8e89-a11a2f7dc383" [[Sockets]] uuid = "6462fe0b-24de-5631-8697-dd941f90decc" [[SparseArrays]] deps = ["LinearAlgebra", "Random"] uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf" [[SpecialFunctions]] deps = ["ChainRulesCore", "LogExpFunctions", "OpenSpecFun_jll"] git-tree-sha1 = "a50550fa3164a8c46747e62063b4d774ac1bcf49" uuid = "276daf66-3868-5448-9aa4-cd146d93841b" version = "1.5.1" [[Statistics]] deps = ["LinearAlgebra", "SparseArrays"] uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2" [[TOML]] deps = ["Dates"] uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76" [[Tar]] deps = ["ArgTools", "SHA"] uuid = "a4e569a6-e804-4fa4-b0f3-eef7a1d5b13e" [[Test]] deps = ["InteractiveUtils", "Logging", "Random", "Serialization"] uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40" [[TimerOutputs]] deps = ["ExprTools", "Printf"] git-tree-sha1 = "209a8326c4f955e2442c07b56029e88bb48299c7" uuid = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f" version = "0.5.12" [[UUIDs]] deps = ["Random", "SHA"] uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4" [[Unicode]] uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5" [[Zlib_jll]] deps = ["Libdl"] uuid = "83775a58-1f1d-513f-b197-d71354ab007a" [[nghttp2_jll]] deps = ["Artifacts", "Libdl"] uuid = "8e850ede-7688-5339-a07c-302acd2aaf8d" [[p7zip_jll]] deps = ["Artifacts", "Libdl"] uuid = "3f19e933-33d8-53b3-aaab-bd5110c3b7a0" ```
Expected behavior
This returns
(1.0f0, 2)
ifmapreduce
does not assume commutativity. In CUDA 3.3.4 it returns(1.0f0, 163841)
.Version info
Details on Julia:
``` Julia Version 1.6.2 Commit 1b93d53fc4 (2021-07-14 15:36 UTC) Platform Info: OS: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-11.0.1 (ORCJIT, broadwell) ```Details on CUDA:
``` CUDA toolkit 11.3.1, artifact installation CUDA driver 11.3.0 NVIDIA driver 465.27.0 Libraries: - CUBLAS: 11.5.1 - CURAND: 10.2.4 - CUFFT: 10.4.2 - CUSOLVER: 11.1.2 - CUSPARSE: 11.6.0 - CUPTI: 14.0.0 - NVML: 11.0.0+465.27 - CUDNN: 8.20.0 (for CUDA 11.3.0) - CUTENSOR: 1.3.0 (for CUDA 11.2.0) Toolchain: - Julia: 1.6.2 - LLVM: 11.0.1 - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0 - Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80 9 devices: 0: NVIDIA Tesla V100-PCIE-32GB (sm_70, 29.176 GiB / 31.749 GiB available) 1: NVIDIA Tesla V100-PCIE-32GB (sm_70, 31.631 GiB / 31.749 GiB available) 2: NVIDIA Tesla V100-PCIE-32GB (sm_70, 31.631 GiB / 31.749 GiB available) 3: NVIDIA Tesla V100-PCIE-32GB (sm_70, 31.631 GiB / 31.749 GiB available) 4: NVIDIA Tesla V100-PCIE-16GB (sm_70, 11.554 GiB / 15.782 GiB available) 5: NVIDIA Tesla P100-PCIE-16GB (sm_60, 14.579 GiB / 15.899 GiB available) 6: NVIDIA Tesla P100-PCIE-16GB (sm_60, 15.786 GiB / 15.899 GiB available) 7: NVIDIA GeForce GTX 1080 Ti (sm_61, 10.736 GiB / 10.917 GiB available) 8: NVIDIA GeForce GTX 1080 Ti (sm_61, 10.897 GiB / 10.917 GiB available) ```Additional context
mapreduce(f, op, ::CuArray)
does not work with non-commutativeop
because the access pattern is not contiguous for each thread (blockDim_reduce * gridDim_reduce > 1
):https://github.com/JuliaGPU/CUDA.jl/blob/e6434c163a902158611ab65be725a67d80f7bf10/src/mapreduce.jl#L119-L124
Of course, this kind of "vectorized" access pattern is ideal when the operator is commutative, to have coalesced memory access. Making the access pattern contiguous for a thread means that it is a gather pattern at the warp level.
I experimented with reduction for non-commutative operators with coalesced access using shfl. Interestingly, the overhead is "only" about 10% for
sum
: https://github.com/JuliaFolds/FoldsCUDA.jl/pull/78#issuecomment-882870206. But this overhead is still large enough if you only need to dosum
.I'm not sure what's the best strategy for CUDA.jl. The fact that people didn't notice this probably means that there's not much needs at the moment. But now that
Union
-typedCuArray
is possible, it becomes attractive to write many interesting monoids (which usually are non-commutative). Testing such monoids on CPU and trying it on CUDA would yield confusing results at the moment.As a short-term solution, maybe documenting that
mapreduce
assumes commutative operators would reduce some confusions?