Closed wassname closed 5 years ago
To leave a response here: We have some unrelated issues running your code, which is necessary to track down where your NaN's are coming from.
Btw: We highly suggest to use tf.keras
for some reasons.
We now provide a very basic mnist example (without any fancy stuff) to showcase a model training process:
https://github.com/cgtuebingen/Flex-Convolution/blob/master/basic_mnist_3d.py
Thanks for the example :)
Thanks for releasing your code!
I realise you not finished releasing yet, but I had a try anyway and noticed an issue. I'm not asking you to fix it, just reporting it in case you were unaware.
When I tried a full model I noticed that I get NaN's if I use more than one layer of pooling or deconv. It sometimes happens only after a few batches rather than immediately. When using CUDA it gives pretty unhelpful errors but when setting
CUDA_DEVICE_ORDER= PCI_BUS_ID CUDA_VISIBLE_DEVICES=
it's more clear where the NaN's come form.So it seems that these layer might include operations that introduce unstable gradients. I haven't looked at your *.cc files for the culprit though.