Closed nikopj closed 1 year ago
Note that the hanging issue described in #142 is still present with TaskPoolEx
, but at least it runs!
I was very confused at first, but It appears the actual error is masked by the catch block handling in https://github.com/JuliaGPU/CUDA.jl/blob/v4.4.0/src/pool.jl#L472-L474, which errors when trying to print the stacktrace. Can you change that to rethrow()
instead or remove the catch block entirely to see what the root error is?
I did a good deal more digging on this, and after asking around it seems to be an issue on the CUDA.jl side. Will update this issue with more details as I get them.
This appears to be fixed on my end now with the upgraded CUDA version!
There seems to be a bug when using a parallel dataloader and transfering to GPU. ~It's a bit difficult to reproduce / not consistent every run (bc of multithreading I suppose). It seems to involve heavy FileIO + CUDA in a for loop.~ I've narrowed it down to using
eachobsparallel
and being a function ofbatchsize
and the number of threads. If the batchsize is not sizeably larger than threads (~x2), then the CUDA free error pops up within 1-3 dataloops.dl_test.jl
, below) produces an error according to this table: nthreadsThis is using a 16 core CPU with 64 GBs of memory.
(
dl_test.jl
)Here's the accomanying error (for example when I run
julia --project -t 8 dl_test.jl 8
). This same error repeats many many many times.Heres the output of
CUDA.versioninfo()
for reference:And the package versions I'm using (
] status
):