NVIDIA / framework-reproducibility

Providing reproducibility in deep learning frameworks
Apache License 2.0
423 stars 40 forks source link

Nondeterminism from tf.image.crop_and_resize #18

Closed xthyax closed 4 years ago

xthyax commented 4 years ago

Hi, first of all, congrats on your repo and your speech as well, I'm running into a situation that I cannot reproduce the result every time I training the model on the custom dataset.

System information

I'm using repo Mask-RCNN of matterport : https://github.com/matterport/Mask_RCNN

As far as I aware that I have lock everything with certain seed at the begin of my modify code :

import os
import random
import tensorflow as tf
from tfdeterminism import patch
patch()
random.seed(42)
np.random.seed(42)
tf.set_random_seed(42)

and also a seed in my DataGenerator as well, so every run with the same number of epoch, every iteration of every epoch will use the same image

But I have notice a thing in mrcnn/model.py

https://github.com/matterport/Mask_RCNN/blob/master/mrcnn/model.py#L421

the function tf.image.crop_and_resize()

I think it involve in non-determinism issue, so where did I miss ? Please help

duncanriach commented 4 years ago

Hi @xthyax, thank you and sorry for the delay in getting back to you. Yes, crop_and_resize_op_gpu.cu.cc is currently listed in the Other Possible GPU-Specific Sources of Non-Determinism section as using CUDA atomicAdd. So it's very likely that this op is a source of nondeterminism in your model.

I have not yet confirmed, with my own code, that this op is problematic though.

If you're able to modify the code that you're running, then please will you run that op on the CPU and see if that gets you to perfect (bit-exact) reproducibility?:

with tf.device('/cpu:0'):
  output = tf.image.crop_and_resize(...)

But there are also other steps to obtaining deterministic functionality, including ensuring that your trainable variables are the same each run before training, and that you're seeding everything that needs to be seeded.

duncanriach commented 4 years ago

I've created TensorFlow issue 42033 for addition of determinism in tf.image.crop_and_resize backprop. Note that variance when running on CPU (of backprop-to-image) is greater than when running on GPU. Closing this issue now. Please re-open and/or continue the discussion here, if you need to.