SciML / DiffEqGPU.jl

GPU-acceleration routines for DifferentialEquations.jl and the broader SciML scientific machine learning ecosystem
https://docs.sciml.ai/DiffEqGPU/stable/
MIT License
272 stars 27 forks source link

Add MetalKernels #241

Closed vchuravy closed 1 year ago

vchuravy commented 1 year ago

I would advocate for not having the backend packages in the Project.toml they are all rather large and loading them unconditionally is a pain (and not supported on 1.6)

Instead we use Requires for 1.6-1.8 and extensions on 1.9

Also adds the experimental MetalKernels package + CI

vchuravy commented 1 year ago

@tgymnich this now shows the failure that you suspected is due to a synchronization error. cc: @maleadt

vchuravy commented 1 year ago

@ChrisRackauckas from my side the big question is can we get rid of the hard dependency on CUDA? I am also not sure what needs to happen for the other tests. Presumably we would need lufact! on the other platforms?

ChrisRackauckas commented 1 year ago

I think it's fine to do a breaking v1.0 to make CUDA.jl no longer a hard dep

ChrisRackauckas commented 1 year ago

@utkarsh530 what's required to finish merging the metal stuff?

ChrisRackauckas commented 1 year ago

I assume from the release notes this may need KA v0.9?

vchuravy commented 1 year ago

Personally I would wait until KA 0.9 and the backend packages are all ready. The change here should be easy and will clean up a couple of things.

utkarsh530 commented 1 year ago

Okay, sounds good. I was getting things ready before the submission. All the tests pass now, so it should be a simple change.

utkarsh530 commented 1 year ago

@vchuravy, any status?

maleadt commented 1 year ago

Ah, I just wasted an hour updating DiffEqGPU only to notice that this otherwise unrelated PR already contains the necessary KA.jl compatibility changes 🤦

@utkarsh530 What's the status of this PR? I can reproduce the exception, and locally it doesn't happen on -g2 either. Annoying. It seems to come from a literal call to error() somewhere.

maleadt commented 1 year ago

I can reproduce the exception, and locally it doesn't happen on -g2 either. Annoying. It seems to come from a literal call to error() somewhere.

It happens here: https://github.com/SciML/DiffEqGPU.jl/blob/dddcb594ce054c0677bc1b18fdabca2fc0c2eaa9/src/perform_step/gpu_vern7_perform_step.jl#L161

thread 2: dt=1.000000
thread 1: dt=1.000000
thread 2: dt=1.940263
thread 1: dt=1.940263
thread 2: dt=3.079406
thread 1: dt=3.079406
thread 2: dt=5.066259
thread 1: dt=5.066259
thread 2: dt=5.989542
thread 1: dt=5.989542
thread 2: dt=0.944200
thread 1: dt=0.944200
thread 2: dt=1.945940
thread 1: dt=1.945940
thread 2: dt=3.109867
thread 1: dt=3.109867
thread 2: dt=0.000000
thread 1: dt=0.000000

FWIW, DiffEqGPU really shouldn't be calling error; KA.jl probably needs to get an output/error/assertion macro.

maleadt commented 1 year ago

Looks like these some divergence within the GPUAV7I step! function:

tmp = (45.000000,0.000000)
btilde1 = 0.002547
k1 = (0.000000,-10.000000)
btilde4 = -0.009658
k4 = (-1.633333,-10.000000)
btilde5 = 0.042065
k5 = (-4.555000,-10.000000)
btilde6 = -0.066682
k6 = (-6.095100,-10.000000)
btilde7 = 0.265010
k7 = (-8.840014,-10.000000)
btilde8 = -0.294303
k8 = (-9.250006,-10.000000)
btilde9 = 0.081317
k9 = (-9.999984,-10.000000)
btilde10 = -0.020295
k10 = (-9.999962,-10.000000)
resulting tmp = (-0.000001,-0.000000)

vs

tmp = (45.000000,0.000000)
btilde1 = 0.002547
k1 = (0.000000,-10.000000)
btilde4 = -0.009658
k4 = (-1.633333,-10.000000)
btilde5 = 0.042065
k5 = (-4.554998,-10.000000)
btilde6 = -0.066682
k6 = (-6.095097,-10.000000)
btilde7 = 0.265010
k7 = (-8.840016,-10.000000)
btilde8 = -0.294303
k8 = (-9.249968,-10.000000)
btilde9 = 0.081317
k9 = (-9.999983,-10.000000)
btilde10 = -0.020295
k10 = (-9.999996,-10.000000)
resulting tmp = (-0.000013,-0.000000)

The generated IR/PTX assembly is identical though. Reduced this to the --device-debug flag to ptxas, which also disables optimizations.

maleadt commented 1 year ago

All of the errors are relatively (acceptably?) small though:

julia> -9.999962f0 ≈ -9.999996f0
true

julia> -9.999984f0 ≈ -9.999983f0
true

julia> -9.250006f0 ≈ -9.249968f0
true

julia> -8.840014f0 ≈ -8.840016f0
true

julia> -6.095100f0 ≈ -6.095097f0
true

julia> -4.555000f0 ≈ -4.554998f0
true

I guess this is an instance of catastrophic cancellation? Especially with the generated code chock full off(mul|add) contract (i.e. fastmath, from calling Base.muladd) I don't think you can expect deterministic results.

utkarsh530 commented 1 year ago

I am waiting for the tagged release of AMDGPU.jl. BTW, why does at 1.9-Nightly registry fetches the version Metal.jl 0.2.0 ?

Yes, I think the errors seems acceptable, I’ll increase the tolerance and merge it. But what seems a bit strange is that everything passes locally, which I am not sure why is that happening. And why the values are changing now? Did we made some consequent changes in GPUCompiler.jl?

maleadt commented 1 year ago

But what seems a bit strange is that everything passes locally, which I am not sure why is that happening.

FWIW, I could reproduce locally using Julia 1.8 and latest master of everything.

And why the values are changing now?

Yeah, not sure. It probably also depends on the CUDA compiler used, which may have gotten bumped by the CUDA.jl update. I also recently upgraded the CI system's CUDA version.

maleadt commented 1 year ago

Yes, I think the errors seems acceptable, I’ll increase the tolerance and merge it.

The issue is an error() thrown from within the kernel; increasing the tolerance won't fix that, right?

BTW, why does at 1.9-Nightly registry fetches the version Metal.jl 0.2.0 ?

I'm not sure, Metal.jl itself runs 1.9 CI, so it is compatible.

utkarsh530 commented 1 year ago

I got this for the latest CI test:

https://buildkite.com/julialang/diffeqgpu-dot-jl/builds/677#018777f6-2cf2-4b31-a11e-3e2d9dcbc127/627-1328

maleadt commented 1 year ago

But that's because you're running with -g2, disabling optimizations as per above.

utkarsh530 commented 1 year ago

LGTM now. I tested AMD GPU tests locally, and all passed now. @ChrisRackauckas let's merge it?