Closed AlexLewandowski closed 5 months ago
Thanks for the report. I can't reproduce this locally, or at least not to the extent you're seeing (only a 280->310us regression). That makes it much harder to pinpoint what exactly has slowed down. Since you see a much more pronounced slowdown, can you isolate this problem to either the CUDA.jl operation that has regressed, or the commit that did so?
I took the time to bisect this because it's causing my model training to completely stall. The performance regression seems to be #2290, but it also seems like #2327 (merged but not released) fixes it.
I have the same issue after the upgrade. Please let me know if you need any other information, I have attached a Pluto file
https://gist.github.com/pawbz/36a915406266df540187049c1e0720b4
@AlexLewandowski @pawbz Can you try the CUDA.jl master branch?
I have tried, no change, unfortunately. Thanks for quick reply.
Hey @pawbs, looking at your screenshot, I suspect your CUDA version did not update. Can you show the output of Pkg.status()
in your notebook? Also make sure you restart the Pluto instance to make sure you load the correct version of CUDA.
You might also want to do this in a temporary environment by adding Pkg.activate(temp=true)
right after you import Pkg
to avoid cluttering up your default environment.
I just compared the original benchmark between v5.2.0 and current master:
@btime get_grads($m, $xs);
# v5.2.0: 230.077 μs (585 allocations: 26.28 KiB)
# master: 254.714 μs (889 allocations: 33.66 KiB)
The bulk of the regression is now gone. There remains a ~10% consistent with @maleadt result along increased allocations. Is it an expected impact of v5.3.0 or worth keep the issue open?
Pkg.activate(temp=true)
Here is an updated screenshot after restarting Pluto every time. So basically, we have around 530us for both master and v5.2.0, and 1.2ms for v5.3.0 Thanks for the input earlier.
Thanks for confirming. So this was fid by https://github.com/JuliaGPU/CUDA.jl/pull/2327.
There remains a ~10% consistent with @maleadt result along increased allocations. Is it an expected impact of v5.3.0 or worth keep the issue open?
Unexpected, but probably not worth keeping the issue open over. If you can isolate this to the operation that has regressed, please open a new issue.
Describe the bug
Performance degradation on CUDA#v5.3.0 when taking gradients using Flux/Zygote.
To reproduce
The Minimal Working Example (MWE) for this bug:
Manifest file for CUDAv5.3.0: https://gist.github.com/AlexLewandowski/e1b62445fb814d2adf1a7b87ff7d6a3b
Manifest file for CUDAv5.2.0: https://gist.github.com/AlexLewandowski/91fe5e60893039c1c45e2a317d1d7714
Expected behavior
Performance to be unaffected by CUDA.jl version upgrade.
Version info
Details on Julia:
Details on CUDA#v5.3.0:
Details on CUDA#v5.2.0:
Additional context
I upgraded to v5.3.0 because I needed to take a gradient of a sorted CuArray with dims as a keyword. Not sure if its the version upgrade itself, or some combination of bad drivers. But I thought it might be worth raising as an issue.