Closed vchuravy closed 1 year ago
@tgymnich this now shows the failure that you suspected is due to a synchronization error. cc: @maleadt
@ChrisRackauckas from my side the big question is can we get rid of the hard dependency on CUDA? I am also not sure what needs to happen for the other tests. Presumably we would need lufact!
on the other platforms?
I think it's fine to do a breaking v1.0 to make CUDA.jl no longer a hard dep
@utkarsh530 what's required to finish merging the metal stuff?
I assume from the release notes this may need KA v0.9?
Personally I would wait until KA 0.9 and the backend packages are all ready. The change here should be easy and will clean up a couple of things.
Okay, sounds good. I was getting things ready before the submission. All the tests pass now, so it should be a simple change.
@vchuravy, any status?
Ah, I just wasted an hour updating DiffEqGPU only to notice that this otherwise unrelated PR already contains the necessary KA.jl compatibility changes 🤦
@utkarsh530 What's the status of this PR?
I can reproduce the exception, and locally it doesn't happen on -g2
either. Annoying. It seems to come from a literal call to error()
somewhere.
I can reproduce the exception, and locally it doesn't happen on
-g2
either. Annoying. It seems to come from a literal call toerror()
somewhere.
It happens here: https://github.com/SciML/DiffEqGPU.jl/blob/dddcb594ce054c0677bc1b18fdabca2fc0c2eaa9/src/perform_step/gpu_vern7_perform_step.jl#L161
thread 2: dt=1.000000
thread 1: dt=1.000000
thread 2: dt=1.940263
thread 1: dt=1.940263
thread 2: dt=3.079406
thread 1: dt=3.079406
thread 2: dt=5.066259
thread 1: dt=5.066259
thread 2: dt=5.989542
thread 1: dt=5.989542
thread 2: dt=0.944200
thread 1: dt=0.944200
thread 2: dt=1.945940
thread 1: dt=1.945940
thread 2: dt=3.109867
thread 1: dt=3.109867
thread 2: dt=0.000000
thread 1: dt=0.000000
FWIW, DiffEqGPU really shouldn't be calling error
; KA.jl probably needs to get an output/error/assertion macro.
Looks like these some divergence within the GPUAV7I
step!
function:
tmp = (45.000000,0.000000)
btilde1 = 0.002547
k1 = (0.000000,-10.000000)
btilde4 = -0.009658
k4 = (-1.633333,-10.000000)
btilde5 = 0.042065
k5 = (-4.555000,-10.000000)
btilde6 = -0.066682
k6 = (-6.095100,-10.000000)
btilde7 = 0.265010
k7 = (-8.840014,-10.000000)
btilde8 = -0.294303
k8 = (-9.250006,-10.000000)
btilde9 = 0.081317
k9 = (-9.999984,-10.000000)
btilde10 = -0.020295
k10 = (-9.999962,-10.000000)
resulting tmp = (-0.000001,-0.000000)
vs
tmp = (45.000000,0.000000)
btilde1 = 0.002547
k1 = (0.000000,-10.000000)
btilde4 = -0.009658
k4 = (-1.633333,-10.000000)
btilde5 = 0.042065
k5 = (-4.554998,-10.000000)
btilde6 = -0.066682
k6 = (-6.095097,-10.000000)
btilde7 = 0.265010
k7 = (-8.840016,-10.000000)
btilde8 = -0.294303
k8 = (-9.249968,-10.000000)
btilde9 = 0.081317
k9 = (-9.999983,-10.000000)
btilde10 = -0.020295
k10 = (-9.999996,-10.000000)
resulting tmp = (-0.000013,-0.000000)
The generated IR/PTX assembly is identical though. Reduced this to the --device-debug
flag to ptxas
, which also disables optimizations.
All of the errors are relatively (acceptably?) small though:
julia> -9.999962f0 ≈ -9.999996f0
true
julia> -9.999984f0 ≈ -9.999983f0
true
julia> -9.250006f0 ≈ -9.249968f0
true
julia> -8.840014f0 ≈ -8.840016f0
true
julia> -6.095100f0 ≈ -6.095097f0
true
julia> -4.555000f0 ≈ -4.554998f0
true
I guess this is an instance of catastrophic cancellation? Especially with the generated code chock full off(mul|add) contract
(i.e. fastmath, from calling Base.muladd
) I don't think you can expect deterministic results.
I am waiting for the tagged release of AMDGPU.jl. BTW, why does at 1.9-Nightly registry fetches the version Metal.jl 0.2.0 ?
Yes, I think the errors seems acceptable, I’ll increase the tolerance and merge it. But what seems a bit strange is that everything passes locally, which I am not sure why is that happening. And why the values are changing now? Did we made some consequent changes in GPUCompiler.jl?
But what seems a bit strange is that everything passes locally, which I am not sure why is that happening.
FWIW, I could reproduce locally using Julia 1.8 and latest master of everything.
And why the values are changing now?
Yeah, not sure. It probably also depends on the CUDA compiler used, which may have gotten bumped by the CUDA.jl update. I also recently upgraded the CI system's CUDA version.
Yes, I think the errors seems acceptable, I’ll increase the tolerance and merge it.
The issue is an error()
thrown from within the kernel; increasing the tolerance won't fix that, right?
BTW, why does at 1.9-Nightly registry fetches the version Metal.jl 0.2.0 ?
I'm not sure, Metal.jl itself runs 1.9 CI, so it is compatible.
I got this for the latest CI test:
But that's because you're running with -g2
, disabling optimizations as per above.
LGTM now. I tested AMD GPU tests locally, and all passed now. @ChrisRackauckas let's merge it?
I would advocate for not having the backend packages in the Project.toml they are all rather large and loading them unconditionally is a pain (and not supported on 1.6)
Instead we use Requires for 1.6-1.8 and extensions on 1.9
Also adds the experimental MetalKernels package + CI