NVIDIA / framework-reproducibility

Providing reproducibility in deep learning frameworks
Apache License 2.0
423 stars 40 forks source link

Inc. solution for AutoGraph loop conversion #16

Closed gavins13 closed 4 years ago

gavins13 commented 4 years ago

Included solution for AutoGraph's conversion of for loops to while_loops with parallel iterations on the GPU

duncanriach commented 4 years ago

Hey @gavinlive!

Thanks for tracking down and reporting this source of non-determinism. I want to capture this knowledge into the document, but I don't believe that Confirmed Current GPU-Specific Sources of Non-Determinism (With Solutions) is the correct location. Without thoroughly exploring the underlying cause, it's not clear to me that this is about a purely GPU-related source of non-determinism. Do you agree with that assessment? if so, I think that it should be covered in two places:

  1. Additional Ingredients in the Determinism Recipe
  2. Sources of Non-Determinism in TensorFlow Unrelated to GPU

I also think that this information can be made more general. The use of tf.while_loop when parallel_iterations is greater than 1 (noting that 10 is the default) may introduce non-determinism into model functionality. Then, additionally, the AutoGraph Transformations caused by tf.function may lead to loops being implemented using tf.while_loop and therefore parallelized, which may introduce non-determinism into model functionality.

Please will you enhance this PR appropriately?

Finally, I would like to add your name to the credits. I can simply add @gavinlive, if you want, or if you give me your first and last name then I will add that.

gavins13 commented 4 years ago

Hey @gavinlive!

Thanks for tracking down and reporting this source of non-determinism. I want to capture this knowledge into the document, but I don't believe that Confirmed Current GPU-Specific Sources of Non-Determinism (With Solutions) is the correct location. Without thoroughly exploring the underlying cause, it's not clear to me that this is about a purely GPU-related source of non-determinism. Do you agree with that assessment? if so, I think that it should be covered in two places:

  1. Additional Ingredients in the Determinism Recipe
  2. Sources of Non-Determinism in TensorFlow Unrelated to GPU

I also think that this information can be made more general. The use of tf.while_loop when parallel_iterations is greater than 1 (noting that 10 is the default) may introduce non-determinism into model functionality. Then, additionally, the AutoGraph Transformations caused by tf.function may lead to loops being implemented using tf.while_loop and therefore parallelized, which may introduce non-determinism into model functionality.

Please will you enhance this PR appropriately?

Finally, I would like to add your name to the credits. I can simply add @gavinlive, if you want, or if you give me your first and last name then I will add that.

Hi @duncanriach,

I will enhance this PR. You can add me as Gavin Seegoolam, thank you!

duncanriach commented 4 years ago

Oh, hey Gavin! We interacted by email recently. Did you get my response to your email?

gavins13 commented 4 years ago

Oh, hey Gavin! We interacted by email recently. Did you get my response to your email?

Yes, I received you response. I believe I responded to it but just in case, basically I added a new TensorFlow issue here: https://github.com/tensorflow/tensorflow/issues/39751 and also added another pull request to this repo which references this.

Hope you're doing well!

duncanriach commented 4 years ago

I didn't receive an email response from you, but I'll get to that other pull request, and the TF issue, soon.

duncanriach commented 4 years ago

Hey @gavinlive, I want to get this PR merged before it starts conflicting. I'm planning to merge it and then do the enhancements I mentioned above.