When benchmarking an operation that ends up returning a CuArray, the REPL becomes irresponsive for several seconds after the output of the benchmark was printed, only for printing the resulting array.
This is what I see in the example below:
Computing and printing the product of two arrays is fast, as expected.
Now when I try to benchmark the same operation:
Until printing 1000×1000 CuArray{Float64, 2, CUDA.DeviceMemory}: everything is normal.
It takes 2 minutes to print the resulting array, after 2.1.
The problem also occurs on Julia 1.11 (no startup file, only CUDA and BenchmarkTools installed). It is not always reproducible, but for me it occurs more often than not.
When benchmarking an operation that ends up returning a
CuArray
, the REPL becomes irresponsive for several seconds after the output of the benchmark was printed, only for printing the resulting array.This is what I see in the example below:
1000×1000 CuArray{Float64, 2, CUDA.DeviceMemory}:
everything is normal.