TensorBFS / CuTropicalGEMM.jl

The fastest Tropical number matrix multiplication on GPU
MIT License
9 stars 0 forks source link

Example: generic tensor network #22

Open GiggleLiu opened 11 months ago

GiggleLiu commented 11 months ago

To run the example

cd examples/combinatorial-optimization
julia --project -e "using Pkg; Pkg.instantiate()"
julia --project main.jl

Performance

Device: NVIDIA RTX A4500, CUDA 12.0

Before CuTropicalGEMM

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
 10.772101 seconds (1.79 M allocations: 83.646 MiB, 0.95% gc time)

@time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 42.961929 seconds (11.23 M allocations: 425.264 MiB, 0.29% gc time, 0.06% compilation time)

After CuTropicalGEMM

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
  1.109122 seconds (1.29 M allocations: 61.970 MiB)

julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 20.308956 seconds (9.96 M allocations: 366.654 MiB, 0.48% gc time, 0.01% compilation time)

Profile

julia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 1.7 s, capturing 157501 events.

Host-side activity: calling CUDA APIs took 732.81 ms (43.10% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│   32.07% │ 545.26 ms │  8289 │  65.78 µs │    3.1 µs │ 507.52 ms │ cudaLaunchKernel        │
│   22.21% │  377.7 ms │  4694 │  80.47 µs │   2.86 µs │   60.8 ms │ cuStreamSynchronize     │
│    4.17% │  70.98 ms │ 14646 │   4.85 µs │   2.15 µs │ 416.76 µs │ cuMemAllocFromPoolAsync │
│    2.81% │  47.79 ms │  4692 │  10.19 µs │   3.34 µs │  11.46 ms │ cuMemcpyHtoDAsync       │
│    2.26% │  38.41 ms │  6967 │   5.51 µs │   3.81 µs │  45.06 µs │ cuLaunchKernel          │
│    0.01% │ 211.95 µs │     1 │ 211.95 µs │ 211.95 µs │ 211.95 µs │ cuMemcpyDtoHAsync       │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 712.51 ms (41.90% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                                                                                                                                 ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   38.98% │ 662.74 ms │  8282 │  80.02 µs │   3.58 µs │  534.3 ms │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                                        ⋯
│    0.74% │   12.6 ms │    11 │   1.15 ms │   1.43 µs │  12.45 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_ES3_IS ⋯
│    0.73% │  12.47 ms │    33 │ 377.79 µs │   1.19 µs │  12.35 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_ES3_IS4_S ⋯
│    0.48% │   8.24 ms │     4 │   2.06 ms │  10.49 µs │   4.16 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_S4_S4_ ⋯
│    0.26% │   4.47 ms │  3329 │   1.34 µs │ 953.67 ns │   4.05 µs │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_ES3_IS4_S4_E                ⋯
│    0.20% │    3.4 ms │  4692 │ 725.62 ns │ 238.42 ns │   2.38 µs │ [copy pageable to device memory]                                                                                                             
...

julia> CUDA.@profile solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32)
Profiler ran for 28.53 s, capturing 1077654 events.

Host-side activity: calling CUDA APIs took 9.62 s (33.72% of the trace)
┌──────────┬───────────┬───────┬───────────┬──────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │ Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼──────────┼───────────┼─────────────────────────┤
│   59.20% │   16.89 s │ 70375 │ 240.04 µs │  2.15 µs │    2.99 s │ cuStreamSynchronize     │
│    3.16% │ 900.96 ms │ 63114 │  14.28 µs │  3.81 µs │   7.76 ms │ cuLaunchKernel          │
│    2.87% │  818.6 ms │ 25043 │  32.69 µs │  3.34 µs │  548.1 ms │ cudaLaunchKernel        │
│    2.70% │ 770.51 ms │ 23459 │  32.84 µs │ 14.07 µs │ 534.53 µs │ cuMemcpyDtoHAsync       │
│    1.74% │ 495.97 ms │ 53489 │   9.27 µs │  2.15 µs │   1.02 ms │ cuMemAllocFromPoolAsync │
│    1.06% │ 302.92 ms │ 23457 │  12.91 µs │  3.58 µs │   3.81 ms │ cuMemcpyHtoDAsync       │
└──────────┴───────────┴───────┴───────────┴──────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 7.8 s (27.34% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                                                                                                                                 ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   22.70% │    6.48 s │ 24932 │ 259.74 µs │    3.1 µs │ 582.84 ms │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                                        ⋯
│    1.21% │  345.8 ms │     1 │  345.8 ms │  345.8 ms │  345.8 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi28ELi1EE11BroadcastedI12CuArrayStyleILi28EE5TupleI5OneToI5Int64ES6_IS7_ES6 ⋯
│    1.01% │ 286.84 ms │     2 │ 143.42 ms │ 140.71 ms │ 146.13 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi27ELi1EE11BroadcastedI12CuArrayStyleILi27EE5TupleI5OneToI5Int64ES6_IS7_ES6 ⋯
│    0.40% │ 114.32 ms │  9382 │  12.19 µs │   2.38 µs │  31.29 ms │ _Z22partial_mapreduce_grid8identity9reductionI6islessE5TupleI4Bool5Int64E16CartesianIndicesILi1ES2_I5OneToIS4_EEES5_ILi1ES2_IS6_IS4_EEE3ValILinfalse ⋯
│    0.35% │ 100.62 ms │ 23459 │   4.29 µs │ 476.84 ns │ 161.65 µs │ [copy device to pageable memory]                                                                                                                     ⋯
│    0.26% │  73.67 ms │     1 │  73.67 ms │  73.67 ms │  73.67 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_S4_S4_ ⋯
│    0.23% │  65.95 ms │     1 │  65.95 ms │  65.95 ms │  65.95 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi26ELi1EE11BroadcastedI12CuArrayStyleILi26EE5TupleI5OneToI5Int64ES6_IS7_ES6 ⋯
│    0.21% │  60.48 ms │     2 │  30.24 ms │  11.44 µs │  60.47 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_S4_S4_ ⋯
│    0.14% │  39.56 ms │  9382 │   4.22 µs │   1.19 µs │   8.41 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI4BoolLi1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE8isapproxS4_I8ExtrudedIS0 ⋯
│    0.14% │  38.68 ms │    39 │  991.7 µs │   1.43 µs │  19.18 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_ES3_IS ⋯
│    0.09% │  24.75 ms │     1 │  24.75 ms │  24.75 ms │  24.75 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi25ELi1EE11BroadcastedI12CuArrayStyleILi25EE5TupleI5OneToI5Int64ES6_IS7_ES6 ⋯
│    0.07% │  20.93 ms │    96 │ 218.02 µs │   1.19 µs │  13.96 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_ES3_IS4_S ⋯
│    0.07% │  20.69 ms │    10 │   2.07 ms │  10.49 µs │   6.73 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_S4_S4_ ⋯
...
codecov[bot] commented 11 months ago

Welcome to Codecov :tada:

Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment.

Thanks for integrating Codecov - We've got you covered :open_umbrella:

GiggleLiu commented 11 months ago

The same benchmark on A800

Before CuTropicalGEMM

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
  4.007866 seconds (1.71 M allocations: 81.452 MiB)

julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 59.282821 seconds (11.29 M allocations: 427.296 MiB, 0.56% gc time, 0.17% compilation time)

After CuTropicalGEMM

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
  1.210406 seconds (1.29 M allocations: 61.970 MiB)

julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 40.313153 seconds (10.04 M allocations: 369.083 MiB, 0.59% gc time, 0.00% compilation time)

It is slower than A4500...

Profile

julia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 1.24 s, capturing 157509 events.

Host-side activity: calling CUDA APIs took 264.52 ms (21.37% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│   13.68% │ 169.35 ms │  8289 │  20.43 µs │   2.38 µs │ 142.61 ms │ cudaLaunchKernel        │
│    3.53% │  43.64 ms │ 14646 │   2.98 µs │   1.43 µs │   5.76 ms │ cuMemAllocFromPoolAsync │
│    2.02% │  25.05 ms │  6967 │    3.6 µs │   2.62 µs │  65.09 µs │ cuLaunchKernel          │
│    1.37% │  16.97 ms │  4694 │   3.62 µs │   1.43 µs │   6.89 ms │ cuStreamSynchronize     │
│    0.96% │  11.89 ms │  4692 │   2.53 µs │   1.91 µs │  20.03 µs │ cuMemcpyHtoDAsync       │
│    0.00% │   26.7 µs │     1 │   26.7 µs │   26.7 µs │   26.7 µs │ cuMemcpyDtoHAsync       │
│    0.00% │   1.19 µs │     6 │ 198.68 ns │    0.0 ns │ 476.84 ns │ cuDeviceGetCount        │
│    0.00% │ 238.42 ns │     1 │ 238.42 ns │ 238.42 ns │ 238.42 ns │ cuDriverGetVersion      │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 298.18 ms (24.09% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                          ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────
│   20.84% │ 257.85 ms │  8282 │  31.13 µs │   4.77 µs │ 167.64 ms │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EE ⋯
│    0.66% │   8.19 ms │  3329 │   2.46 µs │   1.67 µs │   3.34 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.49% │   6.07 ms │  4692 │   1.29 µs │ 953.67 ns │   2.15 µs │ [copy pageable to device memory]              ⋯
│    0.42% │   5.25 ms │    11 │ 476.97 µs │   2.86 µs │   5.11 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.42% │   5.24 ms │  1575 │   3.33 µs │   2.38 µs │   4.05 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.42% │   5.16 ms │    33 │ 156.45 µs │   2.15 µs │   5.02 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.26% │   3.25 ms │     4 │ 812.23 µs │   8.82 µs │    1.7 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.19% │    2.4 ms │  1001 │   2.39 µs │   1.67 µs │   4.05 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
...

julia> CUDA.@profile solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
Profiler ran for 91.44 s, capturing 1077252 events.

Host-side activity: calling CUDA APIs took 3.42 s (3.74% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│    3.91% │    3.58 s │ 70375 │  50.82 µs │   1.43 µs │    1.72 s │ cuStreamSynchronize     │
│    0.41% │ 372.21 ms │ 23459 │  15.87 µs │  12.64 µs │  45.06 µs │ cuMemcpyDtoHAsync       │
│    0.32% │  291.3 ms │ 63080 │   4.62 µs │   2.62 µs │   7.47 ms │ cuLaunchKernel          │
│    0.29% │ 261.51 ms │ 25043 │  10.44 µs │   2.38 µs │ 159.61 ms │ cudaLaunchKernel        │
│    0.21% │ 189.58 ms │ 53455 │   3.55 µs │   1.43 µs │   10.1 ms │ cuMemAllocFromPoolAsync │
│    0.09% │  81.43 ms │ 23457 │   3.47 µs │   2.15 µs │  40.05 µs │ cuMemcpyHtoDAsync       │
│    0.00% │   1.19 µs │     4 │ 298.02 ns │ 238.42 ns │ 476.84 ns │ cuDeviceGetCount        │
│    0.00% │ 476.84 ns │     2 │ 238.42 ns │ 238.42 ns │ 238.42 ns │ cuDeviceGet             │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 4.22 s (4.62% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                          ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────
│    3.95% │    3.62 s │ 24932 │ 145.03 µs │   4.77 µs │  186.6 ms │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EE ⋯
│    0.11% │ 100.07 ms │     1 │ 100.07 ms │ 100.07 ms │ 100.07 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.08% │  74.37 ms │  9382 │   7.93 µs │   3.34 µs │  11.95 ms │ _Z22partial_mapreduce_grid8identity9reduction ⋯
│    0.08% │  73.09 ms │     2 │  36.55 ms │  36.53 ms │  36.57 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.06% │  54.96 ms │ 23459 │   2.34 µs │   1.43 µs │   3.58 µs │ [copy device to pageable memory]              ⋯
│    0.04% │  32.26 ms │  9382 │   3.44 µs │   1.91 µs │   1.88 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.03% │  31.29 ms │ 23457 │   1.33 µs │ 953.67 ns │   4.53 µs │ [copy pageable to device memory]              ⋯
│    0.03% │  24.49 ms │ 10001 │   2.45 µs │   1.67 µs │   3.34 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.03% │  23.08 ms │  9382 │   2.46 µs │   1.43 µs │ 973.46 µs │ _Z2_615CuKernelContext13CuDeviceArrayI8Tropic ⋯
│    0.02% │  22.43 ms │     1 │  22.43 ms │  22.43 ms │  22.43 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.02% │  22.14 ms │  9382 │   2.36 µs │   1.43 µs │ 397.92 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.02% │  19.92 ms │     2 │   9.96 ms │  11.44 µs │  19.91 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.02% │  16.53 ms │     1 │  16.53 ms │  16.53 ms │  16.53 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.02% │  14.25 ms │    39 │ 365.39 µs │   2.15 µs │   7.24 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.01% │  10.38 ms │  3152 │   3.29 µs │   2.15 µs │   4.05 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
...

From the profile we can see the slow down is maybe related to the stream synchronization.

GiggleLiu commented 11 months ago

In some tasks, the CuTropicalGEMM speed up the code as expected. However, in another set of tasks, the stream synchronization takes too muhc time.

Maybe it is related to stream synchronizing mechanism and is related to #19 .

ArrogantGao commented 11 months ago

That is strange, all stream synchronization should be done in C code but not in Julia. I will have a check of that and try to fix #19.

GiggleLiu commented 11 months ago

Profile the version without CuTropicalGEMM

I profiled the version without CuTropicalGEMM. It also has a lot of time spent on neither host nor device. So wield.

julia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 4.15 s, capturing 198954 events.

Host-side activity: calling CUDA APIs took 3.1 s (74.69% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name      ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────
│   72.43% │     3.0 s │ 15256 │ 196.97 µs │   2.38 µs │    2.72 s │ cuLaunchK ⋯
│    1.34% │  55.78 ms │ 14646 │   3.81 µs │   2.86 µs │  31.71 µs │ cuMemAllo ⋯
│    0.28% │  11.71 ms │  4692 │    2.5 µs │   1.91 µs │  16.45 µs │ cuMemcpyH ⋯
│    0.24% │  10.09 ms │  4694 │   2.15 µs │   1.43 µs │   4.53 µs │ cuStreamS ⋯
│    0.00% │  29.56 µs │     1 │  29.56 µs │  29.56 µs │  29.56 µs │ cuMemcpyD ⋯
│    0.00% │ 715.26 ns │     6 │ 119.21 ns │    0.0 ns │ 476.84 ns │ cuDeviceG ⋯
│    0.00% │ 238.42 ns │     1 │ 238.42 ns │ 238.42 ns │ 238.42 ns │ cuDriverG ⋯
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴────────────
                                                                1 column omitted

Device-side activity: GPU was busy for 3.07 s (74.12% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬──────────┬─────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │ Max time │ Name       ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼──────────┼─────────────
│   73.24% │    3.04 s │  8289 │ 366.53 µs │   1.43 µs │   2.74 s │ matmatmul_ ⋯
│    0.16% │   6.77 ms │  3329 │   2.03 µs │   1.19 µs │  2.86 µs │ _Z18permut ⋯
│    0.15% │   6.05 ms │  4692 │   1.29 µs │ 953.67 ns │  2.15 µs │ [copy page ⋯
│    0.12% │   5.11 ms │    33 │ 154.94 µs │   1.67 µs │  4.97 ms │ _Z18permut ⋯
│    0.12% │   5.07 ms │    11 │ 460.88 µs │   2.86 µs │  4.94 ms │ _Z18permut ⋯
│    0.10% │   4.23 ms │  1575 │   2.68 µs │   1.91 µs │  3.81 µs │ _Z16broadc ⋯
│    0.08% │   3.25 ms │     4 │ 811.46 µs │   8.58 µs │   1.7 ms │ _Z18permut ⋯
...

julia> CUDA.@profile solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float64)
Profiler ran for 122.63 s, capturing 1202467 events.

Host-side activity: calling CUDA APIs took 16.83 s (13.73% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│   13.83% │   16.96 s │ 70375 │ 241.05 µs │ 476.84 ns │    5.15 s │ cuStreamSynchronize     │
│    4.82% │    5.91 s │ 88123 │  67.06 µs │   2.15 µs │    5.09 s │ cuLaunchKernel          │
│    0.41% │ 502.22 ms │ 53455 │    9.4 µs │   2.86 µs │ 106.68 ms │ cuMemAllocFromPoolAsync │
│    0.31% │ 375.17 ms │ 23459 │  15.99 µs │   12.4 µs │  68.43 µs │ cuMemcpyDtoHAsync       │
│    0.07% │  81.05 ms │ 23457 │   3.46 µs │   1.91 µs │  40.77 µs │ cuMemcpyHtoDAsync       │
│    0.00% │   1.67 µs │     4 │ 417.23 ns │ 238.42 ns │ 715.26 ns │ cuDeviceGetCount        │
│    0.00% │ 238.42 ns │     2 │ 119.21 ns │    0.0 ns │ 238.42 ns │ cuDeviceGet             │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 23.07 s (18.81% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                          ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────
│   18.35% │    22.5 s │ 25043 │ 898.56 µs │   1.43 µs │     5.1 s │ matmatmul_                                    ⋯
│    0.07% │  83.97 ms │     1 │  83.97 ms │  83.97 ms │  83.97 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.06% │  74.99 ms │     2 │  37.49 ms │  37.49 ms │   37.5 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.06% │  68.46 ms │  9382 │    7.3 µs │   2.86 µs │   10.2 ms │ _Z22partial_mapreduce_grid8identity9reduction ⋯
│    0.04% │  54.84 ms │ 23459 │   2.34 µs │   1.67 µs │   5.01 µs │ [copy device to pageable memory]              ⋯
│    0.03% │  31.13 ms │ 23457 │   1.33 µs │ 715.26 ns │   2.62 µs │ [copy pageable to device memory]              ⋯
│    0.02% │  30.49 ms │  9382 │   3.25 µs │   1.43 µs │   2.95 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.02% │  21.08 ms │     1 │  21.08 ms │  21.08 ms │  21.08 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.02% │  19.82 ms │ 10001 │   1.98 µs │   1.19 µs │    3.1 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.02% │  19.18 ms │  9382 │   2.04 µs │   1.19 µs │   1.11 ms │ _Z2_615CuKernelContext13CuDeviceArrayI8Tropic ⋯
│    0.01% │  17.54 ms │  9382 │   1.87 µs │   1.19 µs │ 322.58 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.01% │  17.24 ms │     1 │  17.24 ms │  17.24 ms │  17.24 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.01% │  16.85 ms │     2 │   8.42 ms │  13.35 µs │  16.83 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.01% │   16.2 ms │    39 │  415.4 µs │   2.62 µs │   8.12 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
...
GiggleLiu commented 11 months ago

A related issue: https://github.com/under-Peter/OMEinsum.jl/issues/133#issuecomment-1012254046

ArrogantGao commented 9 months ago

The performance on A800 seems to be better after I fixed the stream problem, two times faster that the previous benchmark on A4500.

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
  0.558056 seconds (636.56 k allocations: 32.896 MiB)

julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 11.641459 seconds (9.31 M allocations: 351.891 MiB, 2.37% gc time, 0.01% compilation time)

It is shown by the profiler that the time neither on CPU nor on GPU disappear.

julia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 661.94 ms, capturing 202287 events.

Host-side activity: calling CUDA APIs took 390.1 ms (58.93% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                        │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────────┼─────────────────────────┤
│   23.95% │  158.53 ms │  8289 │  19.13 µs ± 1353.11 (   3.1 ‥ 123196.84) │ cudaLaunchKernel        │
│   21.22% │  140.46 ms │ 14646 │   9.59 µs ± 329.82 (  1.19 ‥ 27344.47)   │ cuMemAllocFromPoolAsync │
│    6.59% │   43.62 ms │  6967 │   6.26 µs ± 123.84 (  3.58 ‥ 10338.78)   │ cuLaunchKernel          │
│    2.18% │    14.4 ms │  4692 │   3.07 µs ± 1.29   (  2.38 ‥ 88.69)      │ cuMemcpyHtoDAsync       │
│    1.13% │    7.45 ms │  4694 │   1.59 µs ± 59.94  (  0.48 ‥ 4107.48)    │ cuStreamSynchronize     │
│    0.01% │   35.29 µs │     1 │                                          │ cuMemcpyDtoHAsync       │
│    0.00% │  238.42 ns │     6 │  39.74 ns ± 97.33  (   0.0 ‥ 238.42)     │ cuDeviceGetCount        │
│    0.00% │  238.42 ns │     1 │                                          │ cuDriverGetVersion      │
└──────────┴────────────┴───────┴──────────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 281.15 ms (42.47% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                        │ Name                                                                                                                                 ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   37.33% │  247.08 ms │  8282 │  29.83 µs ± 1869.26 (  4.29 ‥ 169630.77) │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                        ⋯
│    0.93% │    6.13 ms │    33 │ 185.66 µs ± 1045.81 (  1.67 ‥ 6011.25)   │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.89% │    5.87 ms │  4692 │   1.25 µs ± 0.17   (  0.95 ‥ 3.1)        │ [copy pageable to device memory]                                                                                                     ⋯
│    0.86% │     5.7 ms │  3329 │   1.71 µs ± 0.23   (  1.19 ‥ 2.62)       │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_ES3_IS4_S4_ ⋯
│    0.67% │    4.46 ms │    11 │  405.9 µs ± 1305.73 (  2.38 ‥ 4342.56)   │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.55% │    3.65 ms │  1575 │   2.32 µs ± 0.21   (  1.67 ‥ 3.1)        │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5In ⋯
│    0.43% │    2.85 ms │     4 │ 713.65 µs ± 618.86 ( 10.25 ‥ 1378.3)     │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.25% │    1.67 ms │  1001 │   1.67 µs ± 0.21   (  1.19 ‥ 3.1)        │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_ES3_IS4_ ⋯
│    0.23% │    1.54 ms │   122 │   12.6 µs ± 107.18 (  1.43 ‥ 1185.18)    │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.11% │  758.65 µs │   435 │   1.74 µs ± 0.45   (  1.43 ‥ 6.44)       │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_ES3_I ⋯
│    0.06% │  429.39 µs │   230 │   1.87 µs ± 0.63   (  1.43 ‥ 6.91)       │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_ES ⋯
│    0.05% │  312.33 µs │    75 │   4.16 µs ± 5.81   (  1.43 ‥ 27.89)      │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.03% │  212.43 µs │    89 │   2.39 µs ± 0.4    (  1.91 ‥ 5.01)       │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5In ⋯
│    0.02% │  153.54 µs │    11 │  13.96 µs ± 21.05  (  2.38 ‥ 70.57)      │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.01% │   82.25 µs │     7 │  11.75 µs ± 12.38  (  2.86 ‥ 36.24)      │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.01% │   79.87 µs │     4 │  19.97 µs ± 16.62  (  4.77 ‥ 35.29)      │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.01% │   78.68 µs │    39 │   2.02 µs ± 0.24   (  1.43 ‥ 2.38)       │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5In ⋯
│    0.01% │   35.76 µs │     1 │                                          │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.00% │   21.93 µs │     4 │   5.48 µs ± 0.28   (  5.25 ‥ 5.72)       │ _Z16FLOAT_maxplus_TTILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                        ⋯
│    0.00% │   16.69 µs │     3 │   5.56 µs ± 0.28   (  5.25 ‥ 5.72)       │ _Z16FLOAT_maxplus_NTILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                        ⋯
│    0.00% │    4.53 µs │     1 │                                          │ _Z22partial_mapreduce_grid8identity1_8TropicalI7Float32E16CartesianIndicesILi2E5TupleI5OneToI5Int64ES5_IS6_EEES3_ILi2ES4_IS5_IS6_ES5 ⋯
│    0.00% │    4.05 µs │     1 │                                          │ [copy device to pageable memory]                                                                                                     ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

But I failed to profile the second one, it seems that when I try

CUDA.@profile solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);

it used up all the 80G global memory.

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    187920      C   ....10.0-rc2+0.x64.linux.gnu/bin/julia    81084MiB |
+---------------------------------------------------------------------------------------+