asappresearch / sru

Training RNNs as Fast as CNNs (https://arxiv.org/abs/1709.02755)
MIT License
2.1k stars 306 forks source link

RuntimeError: SRU_Compute_GPULegacyBackward is not differentiable twice #43

Open desire2020 opened 6 years ago

desire2020 commented 6 years ago

I tried to implement a discriminator of WGAN-GP via SRU but failed with this error.

taoleicn commented 6 years ago

I need more information. Does your WGAN-GP model require high-order gradients?

Are these posts related ? https://discuss.pytorch.org/t/avgpool2d-is-not-differentiable-twice/4566 https://discuss.pytorch.org/t/improved-wgan-scatter-is-not-differentiable-twice/9161/5

desire2020 commented 6 years ago

Yes, but they are of little help with this problem. The problem is that WGAN-GP has a term in its loss function which is the norm of the original model's gradient w.r.t. a input. Thus in backward process, this term (which is already an one-order gradient) need to be differentiate again i.e. the high-order gradient of the original network must be avaliable. In TensorFlow, since tf.gradients is well-implemented, there is no problem with the unofficial implementation of SRU, although it is quite slow (in my experiment it costs 4x time compared to your great optimized version in PyTorch when implementing an exactly same model in both frameworks).

In summary, WGAN-GP (and also other gradient penalty based methods) does require at least two-order gradient avaliability of the modules. If possible, could you pay some attention to this problem.

BTW, SRU is an impressive work according to my experiments, both in convergence, inference speed and generalization ability compared to LSTM. Thank you for introducing it!

taoleicn commented 6 years ago

@desire2020 Thank you for trying it!

I'd have to see how second-order gradient computation is implemented in pytorch, and this may require additional non-trivial implementation.

I guess there's no obvious solution right now. unfortunately..