Closed white-alistair closed 1 month ago
Use the one in Lux that is faster, and I fixed this bug there https://lux.csail.mit.edu/stable/api/Lux/autodiff#Lux.batched_jacobian
To address the issue, ForwardDiff max chunksize is 12 but NTuple doesn't specialize till size 12 so that leads to a dynamic dispatch and the CUDA code fails to compile. Lux forces the max chunksize to be 8
Hi, I appreciate this package is experimental but I came across the following strange behaviour on a V100 GPU. Above a certain size in the first dimension, the batched jacobian no longer works on the GPU, while continuing to function on the CPU. The batch dimension doesn't appear to make a difference.
MWE
Stack trace