NaN's in gradient when using flex_pool and flex_deconv

wassname commented 5 years ago

Thanks for releasing your code!

I realise you not finished releasing yet, but I had a try anyway and noticed an issue. I'm not asking you to fix it, just reporting it in case you were unaware.

When I tried a full model I noticed that I get NaN's if I use more than one layer of pooling or deconv. It sometimes happens only after a few batches rather than immediately. When using CUDA it gives pretty unhelpful errors but when setting CUDA_DEVICE_ORDER= PCI_BUS_ID CUDA_VISIBLE_DEVICES= it's more clear where the NaN's come form.

So it seems that these layer might include operations that introduce unstable gradients. I haven't looked at your *.cc files for the culprit though.

PatWie commented 5 years ago

To leave a response here: We have some unrelated issues running your code, which is necessary to track down where your NaN's are coming from.

Btw: We highly suggest to use tf.keras for some reasons.

PatWie commented 5 years ago

We now provide a very basic mnist example (without any fancy stuff) to showcase a model training process:

https://github.com/cgtuebingen/Flex-Convolution/blob/master/basic_mnist_3d.py

wassname commented 5 years ago

Thanks for the example :)

cgtuebingen / Flex-Convolution

NaN's in gradient when using flex_pool and flex_deconv #1