Closed github-actions[bot] closed 3 years ago
@vchuravy do you know what this is? https://buildkite.com/julialang/diffeqgpu-dot-jl/builds/72#6b670c30-fc02-4414-b271-e786faca82ba/284-502
@vchuravy @maleadt can I get some help over the next week finding out what's up with tasks on CUDAKernels v0.2 + CUDA 3.0? I can't figure out which of the two is the issue.
For some reason you ended up running GPU code on the CPU. You started running a CPU kernel and then called a GPU function. Very weird.
A percentage of the trajectories are ran on the CPU on a separate task from the GPU (and not using KernelAbstractions.jl).
FATAL ERROR: Symbol "__nv_llabs"not found
--
#33
at
/root/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/KernelAbstractions/8v5HI/src/cpu.jl:22
Yeah so why are we calling CUDA.abs
from the run function for the CPU KA code.
That's what I'm asking you. All it's doing is creating a function that gets run on a task:
https://github.com/SciML/DiffEqGPU.jl/blob/master/src/DiffEqGPU.jl#L181-L188
and then retrieves the result when the GPU stuff is done:
https://github.com/SciML/DiffEqGPU.jl/blob/master/src/DiffEqGPU.jl#L195-L199
It should be the unmodified working CPU-based code that is running there, and KA shouldn't be involved at all.
It used to work fine though, so I'm not sure what could've changed. Does KA or CUDA.jl change the global method directly while it's being used for GPUs, so that if a CPU version of the function is run simultaneously to the GPU version this will happen?
@vchuravy Does KA's CPU back-end use its own abstract interpreter? If so, maybe it doesn't use the correct world bounds, resulting in CUDA.jl's GPU methods being picked up.
@YingboMa @chriselrod I think this exposed a FastBroadcast.jl bug now?
It looks like the last few tests have now failed because of finalizer issues. Something with GPU GC? We're not doing any manual free's here.
I can't seem to reproduce this GC bug locally?
Cyclops was being used by other users. I removed it from CI, once again. I'll restart the jobs.
It seems to throw CUDA_ERROR_LAUNCH_FAILED::cudaError_enum = 0x000002cf
now, but on unrelated API calls. So it's hard to trace this back to where this came from. Maybe try running under compute-sanitizer --show-backtrace=no --launch-timeout=0
?
I'm not quite sure what you're referring to:
ERROR: unknown option `--compute-sanitizer --show-backtrace=no --launch-timeout=0`
Julia has exited.
Is there a way to modify the arguments of the BK script so this is done on CI?
compute-sanitizer
is a CUDA tool, not a Julia option. I've pushed a job definition.
You fixed it, good job.
This pull request changes the compat entry for the
CUDAKernels
package from0.1
to0.1, 0.2
.This keeps the compat entries for earlier versions.
Note: I have not tested your package with this new compat entry. It is your responsibility to make sure that your package tests pass before you merge this pull request.