SciML / DiffEqGPU.jl

GPU-acceleration routines for DifferentialEquations.jl and the broader SciML scientific machine learning ecosystem
https://docs.sciml.ai/DiffEqGPU/stable/
MIT License
285 stars 29 forks source link

CompatHelper: bump compat for "CUDAKernels" to "0.2" #104

Closed github-actions[bot] closed 3 years ago

github-actions[bot] commented 3 years ago

This pull request changes the compat entry for the CUDAKernels package from 0.1 to 0.1, 0.2.

This keeps the compat entries for earlier versions.

Note: I have not tested your package with this new compat entry. It is your responsibility to make sure that your package tests pass before you merge this pull request.

ChrisRackauckas commented 3 years ago

@vchuravy do you know what this is? https://buildkite.com/julialang/diffeqgpu-dot-jl/builds/72#6b670c30-fc02-4414-b271-e786faca82ba/284-502

ChrisRackauckas commented 3 years ago

@vchuravy @maleadt can I get some help over the next week finding out what's up with tasks on CUDAKernels v0.2 + CUDA 3.0? I can't figure out which of the two is the issue.

vchuravy commented 3 years ago

For some reason you ended up running GPU code on the CPU. You started running a CPU kernel and then called a GPU function. Very weird.

ChrisRackauckas commented 3 years ago

A percentage of the trajectories are ran on the CPU on a separate task from the GPU (and not using KernelAbstractions.jl).

vchuravy commented 3 years ago
FATAL ERROR: Symbol "__nv_llabs"not found
--
#33
 at 
/root/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/KernelAbstractions/8v5HI/src/cpu.jl:22

Yeah so why are we calling CUDA.abs from the run function for the CPU KA code.

ChrisRackauckas commented 3 years ago

That's what I'm asking you. All it's doing is creating a function that gets run on a task:

https://github.com/SciML/DiffEqGPU.jl/blob/master/src/DiffEqGPU.jl#L181-L188

and then retrieves the result when the GPU stuff is done:

https://github.com/SciML/DiffEqGPU.jl/blob/master/src/DiffEqGPU.jl#L195-L199

It should be the unmodified working CPU-based code that is running there, and KA shouldn't be involved at all.

It used to work fine though, so I'm not sure what could've changed. Does KA or CUDA.jl change the global method directly while it's being used for GPUs, so that if a CPU version of the function is run simultaneously to the GPU version this will happen?

maleadt commented 3 years ago

@vchuravy Does KA's CPU back-end use its own abstract interpreter? If so, maybe it doesn't use the correct world bounds, resulting in CUDA.jl's GPU methods being picked up.

ChrisRackauckas commented 3 years ago

@YingboMa @chriselrod I think this exposed a FastBroadcast.jl bug now?

ChrisRackauckas commented 3 years ago

It looks like the last few tests have now failed because of finalizer issues. Something with GPU GC? We're not doing any manual free's here.

ChrisRackauckas commented 3 years ago

I can't seem to reproduce this GC bug locally?

maleadt commented 3 years ago

Cyclops was being used by other users. I removed it from CI, once again. I'll restart the jobs.

maleadt commented 3 years ago

It seems to throw CUDA_ERROR_LAUNCH_FAILED::cudaError_enum = 0x000002cf now, but on unrelated API calls. So it's hard to trace this back to where this came from. Maybe try running under compute-sanitizer --show-backtrace=no --launch-timeout=0?

ChrisRackauckas commented 3 years ago

I'm not quite sure what you're referring to:

ERROR: unknown option `--compute-sanitizer --show-backtrace=no --launch-timeout=0`

Julia has exited.

Is there a way to modify the arguments of the BK script so this is done on CI?

maleadt commented 3 years ago

compute-sanitizer is a CUDA tool, not a Julia option. I've pushed a job definition.

ChrisRackauckas commented 3 years ago

You fixed it, good job.