JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.2k stars 218 forks source link

mapreduce still assumes commutative op #1058

Open tkf opened 3 years ago

tkf commented 3 years ago

Describe the bug

mapreduce(f, op, ::CuArray) assumes commutative op.

I should've paid more attention in https://github.com/JuliaGPU/CUDA.jl/pull/500 but it didn't solve https://github.com/JuliaGPU/CUDA.jl/issues/484.

To reproduce

MWE:

using CUDA

function f((xmax, imax), (x, i))
    if xmax < x
        xmax = x
        imax = i
    end
    (xmax, imax)
end

xs = CUDA.zeros(1000000);

fill!(xs, 0);
xs[2] = 1;
xs[163841] = 1;

mapreduce(tuple, f, xs, eachindex(xs); init = (typemin(xs[1]), -1))
Manifest.toml

``` # This file is machine-generated - editing it directly is not advised [[AbstractFFTs]] deps = ["LinearAlgebra"] git-tree-sha1 = "485ee0867925449198280d4af84bdb46a2a404d0" uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c" version = "1.0.1" [[Adapt]] deps = ["LinearAlgebra"] git-tree-sha1 = "84918055d15b3114ede17ac6a7182f68870c16f7" uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e" version = "3.3.1" [[ArgTools]] uuid = "0dad84c5-d112-42e6-8d28-ef12dabb789f" [[Artifacts]] uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33" [[BFloat16s]] deps = ["LinearAlgebra", "Test"] git-tree-sha1 = "4af69e205efc343068dc8722b8dfec1ade89254a" uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b" version = "0.1.0" [[Base64]] uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f" [[CEnum]] git-tree-sha1 = "215a9aa4a1f23fbd05b92769fdd62559488d70e9" uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82" version = "0.4.1" [[CUDA]] deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CompilerSupportLibraries_jll", "DataStructures", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "SpecialFunctions", "TimerOutputs"] git-tree-sha1 = "5e696e37e51b01ae07bd9f700afe6cbd55250bce" uuid = "052768ef-5323-5732-b1bb-66c8b64840ba" version = "3.3.4" [[ChainRulesCore]] deps = ["Compat", "LinearAlgebra", "SparseArrays"] git-tree-sha1 = "f53ca8d41e4753c41cdafa6ec5f7ce914b34be54" uuid = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4" version = "0.10.13" [[Compat]] deps = ["Base64", "Dates", "DelimitedFiles", "Distributed", "InteractiveUtils", "LibGit2", "Libdl", "LinearAlgebra", "Markdown", "Mmap", "Pkg", "Printf", "REPL", "Random", "SHA", "Serialization", "SharedArrays", "Sockets", "SparseArrays", "Statistics", "Test", "UUIDs", "Unicode"] git-tree-sha1 = "dc7dedc2c2aa9faf59a55c622760a25cbefbe941" uuid = "34da2185-b29b-5c13-b0c7-acf172513d20" version = "3.31.0" [[CompilerSupportLibraries_jll]] deps = ["Artifacts", "Libdl"] uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae" [[DataStructures]] deps = ["Compat", "InteractiveUtils", "OrderedCollections"] git-tree-sha1 = "4437b64df1e0adccc3e5d1adbc3ac741095e4677" uuid = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8" version = "0.18.9" [[Dates]] deps = ["Printf"] uuid = "ade2ca70-3891-5945-98fb-dc099432e06a" [[DelimitedFiles]] deps = ["Mmap"] uuid = "8bb1440f-4735-579b-a4ab-409b98df4dab" [[Distributed]] deps = ["Random", "Serialization", "Sockets"] uuid = "8ba89e20-285c-5b6f-9357-94700520ee1b" [[DocStringExtensions]] deps = ["LibGit2"] git-tree-sha1 = "a32185f5428d3986f47c2ab78b1f216d5e6cc96f" uuid = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae" version = "0.8.5" [[Downloads]] deps = ["ArgTools", "LibCURL", "NetworkOptions"] uuid = "f43a241f-c20a-4ad4-852c-f6b1247861c6" [[ExprTools]] git-tree-sha1 = "b7e3d17636b348f005f11040025ae8c6f645fe92" uuid = "e2ba6199-217a-4e67-a87a-7c52f15ade04" version = "0.1.6" [[GPUArrays]] deps = ["AbstractFFTs", "Adapt", "LinearAlgebra", "Printf", "Random", "Serialization", "Statistics"] git-tree-sha1 = "ececbf05f8904c92814bdbd0aafd5540b0bf2e9a" uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7" version = "7.0.1" [[GPUCompiler]] deps = ["DataStructures", "ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "TimerOutputs", "UUIDs"] git-tree-sha1 = "e8a09182a4440489e2e3dedff5ad3f6bbe555396" uuid = "61eb1bfa-7361-4325-ad38-22787b887f55" version = "0.12.5" [[InteractiveUtils]] deps = ["Markdown"] uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240" [[JLLWrappers]] deps = ["Preferences"] git-tree-sha1 = "642a199af8b68253517b80bd3bfd17eb4e84df6e" uuid = "692b3bcd-3c85-4b1f-b108-f13ce0eb3210" version = "1.3.0" [[LLVM]] deps = ["CEnum", "LLVMExtra_jll", "Libdl", "Printf", "Unicode"] git-tree-sha1 = "1b7ba36ea7aa6fa2278118951bad114fbb8359f2" uuid = "929cbde3-209d-540e-8aea-75f648917ca0" version = "4.1.0" [[LLVMExtra_jll]] deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"] git-tree-sha1 = "b36c0677a0549c7d1dc8719899a4133abbfacf7d" uuid = "dad2f222-ce93-54a1-a47d-0025e8a3acab" version = "0.0.6+0" [[LazyArtifacts]] deps = ["Artifacts", "Pkg"] uuid = "4af54fe1-eca0-43a8-85a7-787d91b784e3" [[LibCURL]] deps = ["LibCURL_jll", "MozillaCACerts_jll"] uuid = "b27032c2-a3e7-50c8-80cd-2d36dbcbfd21" [[LibCURL_jll]] deps = ["Artifacts", "LibSSH2_jll", "Libdl", "MbedTLS_jll", "Zlib_jll", "nghttp2_jll"] uuid = "deac9b47-8bc7-5906-a0fe-35ac56dc84c0" [[LibGit2]] deps = ["Base64", "NetworkOptions", "Printf", "SHA"] uuid = "76f85450-5226-5b5a-8eaa-529ad045b433" [[LibSSH2_jll]] deps = ["Artifacts", "Libdl", "MbedTLS_jll"] uuid = "29816b5a-b9ab-546f-933c-edad1886dfa8" [[Libdl]] uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb" [[LinearAlgebra]] deps = ["Libdl"] uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" [[LogExpFunctions]] deps = ["DocStringExtensions", "LinearAlgebra"] git-tree-sha1 = "7bd5f6565d80b6bf753738d2bc40a5dfea072070" uuid = "2ab3a3ac-af41-5b50-aa03-7779005ae688" version = "0.2.5" [[Logging]] uuid = "56ddb016-857b-54e1-b83d-db4d58db5568" [[Markdown]] deps = ["Base64"] uuid = "d6f4376e-aef5-505a-96c1-9c027394607a" [[MbedTLS_jll]] deps = ["Artifacts", "Libdl"] uuid = "c8ffd9c3-330d-5841-b78e-0817d7145fa1" [[Mmap]] uuid = "a63ad114-7e13-5084-954f-fe012c677804" [[MozillaCACerts_jll]] uuid = "14a3606d-f60d-562e-9121-12d972cd8159" [[NetworkOptions]] uuid = "ca575930-c2e3-43a9-ace4-1e988b2c1908" [[OpenSpecFun_jll]] deps = ["Artifacts", "CompilerSupportLibraries_jll", "JLLWrappers", "Libdl", "Pkg"] git-tree-sha1 = "13652491f6856acfd2db29360e1bbcd4565d04f1" uuid = "efe28fd5-8261-553b-a9e1-b2916fc3738e" version = "0.5.5+0" [[OrderedCollections]] git-tree-sha1 = "85f8e6578bf1f9ee0d11e7bb1b1456435479d47c" uuid = "bac558e1-5e72-5ebc-8fee-abe8a469f55d" version = "1.4.1" [[Pkg]] deps = ["Artifacts", "Dates", "Downloads", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "Serialization", "TOML", "Tar", "UUIDs", "p7zip_jll"] uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f" [[Preferences]] deps = ["TOML"] git-tree-sha1 = "00cfd92944ca9c760982747e9a1d0d5d86ab1e5a" uuid = "21216c6a-2e73-6563-6e65-726566657250" version = "1.2.2" [[Printf]] deps = ["Unicode"] uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7" [[REPL]] deps = ["InteractiveUtils", "Markdown", "Sockets", "Unicode"] uuid = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb" [[Random]] deps = ["Serialization"] uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" [[Random123]] deps = ["Libdl", "Random", "RandomNumbers"] git-tree-sha1 = "0e8b146557ad1c6deb1367655e052276690e71a3" uuid = "74087812-796a-5b5d-8853-05524746bad3" version = "1.4.2" [[RandomNumbers]] deps = ["Random", "Requires"] git-tree-sha1 = "441e6fc35597524ada7f85e13df1f4e10137d16f" uuid = "e6cf234a-135c-5ec9-84dd-332b85af5143" version = "1.4.0" [[Reexport]] git-tree-sha1 = "5f6c21241f0f655da3952fd60aa18477cf96c220" uuid = "189a3867-3050-52da-a836-e630ba90ab69" version = "1.1.0" [[Requires]] deps = ["UUIDs"] git-tree-sha1 = "4036a3bd08ac7e968e27c203d45f5fff15020621" uuid = "ae029012-a4dd-5104-9daa-d747884805df" version = "1.1.3" [[SHA]] uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce" [[Serialization]] uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b" [[SharedArrays]] deps = ["Distributed", "Mmap", "Random", "Serialization"] uuid = "1a1011a3-84de-559e-8e89-a11a2f7dc383" [[Sockets]] uuid = "6462fe0b-24de-5631-8697-dd941f90decc" [[SparseArrays]] deps = ["LinearAlgebra", "Random"] uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf" [[SpecialFunctions]] deps = ["ChainRulesCore", "LogExpFunctions", "OpenSpecFun_jll"] git-tree-sha1 = "a50550fa3164a8c46747e62063b4d774ac1bcf49" uuid = "276daf66-3868-5448-9aa4-cd146d93841b" version = "1.5.1" [[Statistics]] deps = ["LinearAlgebra", "SparseArrays"] uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2" [[TOML]] deps = ["Dates"] uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76" [[Tar]] deps = ["ArgTools", "SHA"] uuid = "a4e569a6-e804-4fa4-b0f3-eef7a1d5b13e" [[Test]] deps = ["InteractiveUtils", "Logging", "Random", "Serialization"] uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40" [[TimerOutputs]] deps = ["ExprTools", "Printf"] git-tree-sha1 = "209a8326c4f955e2442c07b56029e88bb48299c7" uuid = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f" version = "0.5.12" [[UUIDs]] deps = ["Random", "SHA"] uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4" [[Unicode]] uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5" [[Zlib_jll]] deps = ["Libdl"] uuid = "83775a58-1f1d-513f-b197-d71354ab007a" [[nghttp2_jll]] deps = ["Artifacts", "Libdl"] uuid = "8e850ede-7688-5339-a07c-302acd2aaf8d" [[p7zip_jll]] deps = ["Artifacts", "Libdl"] uuid = "3f19e933-33d8-53b3-aaab-bd5110c3b7a0" ```

Expected behavior

This returns (1.0f0, 2) if mapreduce does not assume commutativity. In CUDA 3.3.4 it returns (1.0f0, 163841).

Version info

Details on Julia: ``` Julia Version 1.6.2 Commit 1b93d53fc4 (2021-07-14 15:36 UTC) Platform Info: OS: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-11.0.1 (ORCJIT, broadwell) ```
Details on CUDA: ``` CUDA toolkit 11.3.1, artifact installation CUDA driver 11.3.0 NVIDIA driver 465.27.0 Libraries: - CUBLAS: 11.5.1 - CURAND: 10.2.4 - CUFFT: 10.4.2 - CUSOLVER: 11.1.2 - CUSPARSE: 11.6.0 - CUPTI: 14.0.0 - NVML: 11.0.0+465.27 - CUDNN: 8.20.0 (for CUDA 11.3.0) - CUTENSOR: 1.3.0 (for CUDA 11.2.0) Toolchain: - Julia: 1.6.2 - LLVM: 11.0.1 - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0 - Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80 9 devices: 0: NVIDIA Tesla V100-PCIE-32GB (sm_70, 29.176 GiB / 31.749 GiB available) 1: NVIDIA Tesla V100-PCIE-32GB (sm_70, 31.631 GiB / 31.749 GiB available) 2: NVIDIA Tesla V100-PCIE-32GB (sm_70, 31.631 GiB / 31.749 GiB available) 3: NVIDIA Tesla V100-PCIE-32GB (sm_70, 31.631 GiB / 31.749 GiB available) 4: NVIDIA Tesla V100-PCIE-16GB (sm_70, 11.554 GiB / 15.782 GiB available) 5: NVIDIA Tesla P100-PCIE-16GB (sm_60, 14.579 GiB / 15.899 GiB available) 6: NVIDIA Tesla P100-PCIE-16GB (sm_60, 15.786 GiB / 15.899 GiB available) 7: NVIDIA GeForce GTX 1080 Ti (sm_61, 10.736 GiB / 10.917 GiB available) 8: NVIDIA GeForce GTX 1080 Ti (sm_61, 10.897 GiB / 10.917 GiB available) ```

Additional context

mapreduce(f, op, ::CuArray) does not work with non-commutative op because the access pattern is not contiguous for each thread (blockDim_reduce * gridDim_reduce > 1):

https://github.com/JuliaGPU/CUDA.jl/blob/e6434c163a902158611ab65be725a67d80f7bf10/src/mapreduce.jl#L119-L124

Of course, this kind of "vectorized" access pattern is ideal when the operator is commutative, to have coalesced memory access. Making the access pattern contiguous for a thread means that it is a gather pattern at the warp level.

I experimented with reduction for non-commutative operators with coalesced access using shfl. Interestingly, the overhead is "only" about 10% for sum: https://github.com/JuliaFolds/FoldsCUDA.jl/pull/78#issuecomment-882870206. But this overhead is still large enough if you only need to do sum.

I'm not sure what's the best strategy for CUDA.jl. The fact that people didn't notice this probably means that there's not much needs at the moment. But now that Union-typed CuArray is possible, it becomes attractive to write many interesting monoids (which usually are non-commutative). Testing such monoids on CPU and trying it on CUDA would yield confusing results at the moment.

As a short-term solution, maybe documenting that mapreduce assumes commutative operators would reduce some confusions?

maleadt commented 3 years ago

As a short-term solution, maybe documenting that mapreduce assumes commutative operators would reduce some confusions?

I'm not sure people actually read docstrings for specific mapreduce methods.

Maybe we should add an algorithm or commutative keyword argument though, where we can switch the grid-stride loop to one compatible with non-commutative operators. That may also allow us to partially revert https://github.com/JuliaGPU/CUDA.jl/pull/500, because IIRC that also resulted in worse coalescing (as visualized in https://sodocumentation.net/cuda/topic/6566/parallel-reduction--e-g--how-to-sum-an-array-).

tkf commented 3 years ago

Yeah, the opt-in commutativity assumption sounds like a good idea to me.

(+ Thanks for sharing an interesting article!)