Closed DavidUdell closed 1 month ago
I am torn on whether this is actually a vanishing grads issue or a true observation in the data. A couple of fixing that should have worked for vanishing grads--moving everything to float 64s, loss scaling--didn't change results. But there may be an implementation level thing in autograd that I'm not fully understanding. For now, I have a quick patch to prevent fatal crashes in this case, working on the assumption that this is observation and not a bug.
It's a vanishing grads issue, I'm fairly sure. Like, the autoencoders are thinning out the grads excessively when far enough back? And zero grad tensors at any point will totally wipe out reasonable acts values, of course.