Open ArrogantGao opened 1 year ago
Is it possible that the issue is related to CuStream? Because CUDA.synchronize() function takes CuStream as the input argument. Maybe you need to pass CUDA.jl CuStream to the C code.
In your current code, the stream id seems to be the default value 0.
julia> @time (mul!(b, a, a); CUDA.synchronize(CUDA.context()))
0.093036 seconds (2 allocations: 48 bytes)
This one works. Using the CuContext
as the input will synchronize all streams in the context, which is a heavy API.
It would be great if the stream id can be an input.
The current benchmark in the RAEDME does not look good. When we read benchmark, we always read the min-time, because it reflects the true performance.
julia> using CuTropicalGEMM
julia> @benchmark CUDA.@sync $a * $a
BenchmarkTools.Trial: 93 samples with 4 evaluations.
Range (min … max): 6.653 μs … 158.961 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.535 ms ┊ GC (median): 0.00%
Time (mean ± σ): 13.499 ms ± 15.867 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█
▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
6.65 μs Histogram: frequency by time 13.5 ms <
Memory estimate: 256 bytes, allocs estimate: 7.
BenchmarkTools are not working correctly:
Comparing to results directly from the C-CUDA tests, the result of
@ benchmark
is correct.