Open vincentmolin opened 2 years ago
Let's add these to tests with the activations. I suppose we would want to cover a decent gamut of layers as well, so we can have the tests in Flux as extensions of https://github.com/FluxML/Flux.jl/blob/0b7e1b61addbe245e4a565d522df334ce0d41584/test/cuda/layers.jl#L84
Thanks for the report, this is an interesting one. The chain points to https://github.com/FluxML/Zygote.jl/blob/v0.6.34/src/lib/broadcast.jl#L241, which when differentiated through runs the very not GPU friendly https://github.com/FluxML/Zygote.jl/blob/v0.6.34/src/lib/array.jl#L197. I'm not sure why other activations are fine here (would have to look at the call stack there to be sure). @mcabbott would replacing https://github.com/FluxML/Zygote.jl/blob/v0.6.34/src/lib/broadcast.jl#L241 by y = ForwardDiff.value.(out)
help here?
In general, we would expect to be able to differentiate over higher orders with map
(and also differentiate through f
too). That line is pretty general, and would be the same for GPU and CPU cases iirc.
When running forward, yes, but the map adjoint captures the context along with a bunch of other not GPU-friendly state in https://github.com/FluxML/Zygote.jl/blob/v0.6.34/src/lib/array.jl#L197. To my knowledge broadcasting does not do this, but whether switching map for broadcast might run into issues with nested Duals I'm not sure.
Using
leakyrelu
causes a compilation error when differentiating through the following gradient penalty loss. It works on cpu and using for exampleelu/relu
on gpu.Throws