[WIP] Feature/patch/softmax cross entropy with logits

MFreidank commented 4 years ago

Addressing previously reported tensorflow issue #38185. Also tied to tfdeterminism issues #19, #14, #9. Fused implementation of {sparse_,}softmax_cross_entropy_with_logits exhibits non-determinism in its backprop. This PR will provide a patch that maintains the same function interface but computes the computational steps involved in a non-fused way that is fully deterministic.

duncanriach commented 4 years ago

Comments above. Heads-up, before you're done, please remember to implement test/test_fused_softmax_cross_entropy.py in the spirit of test/test_patch_bias_add.py. This will use the appropriate forward code from the upstream TensorFlow repo along with the deterministic gradient test code that I provided in TensorFlow Issue 38185 (also testing all the modified op bindings).

I'm going to work on adding enable_determinism, from which _patch_fused_softmax_cross_entropy can be called.

duncanriach commented 4 years ago

I just added enable_determinism with this commit. You should use enable_determinism to apply the patch in test/test_patch_fused_softmax_cross_entropy.py.

I'm going reorganize the testing quite a lot before release. For the testing part, I would like you to focus on getting the following complete and working:

$ cd test
$ echo "Test TF1 API:"
$ ./container.sh tensorflow/tensorflow:1.14.0-gpu python test_patch_fused_softmax_cross_entropy.py
$ echo "Test TF2 API:"
$ ./container.sh tensorflow/tensorflow:2.2.0-gpu python test_patch_fused_softmax_cross_entropy.py

phqtuyen commented 4 years ago

Hi, my name is Thomas, I am working at the Oracle Digital Assistant team in which we made extensively use of Tensorflow and Tensorflow-determinism, we really appreciate this wonderful library. I just wish to ask is this issue going to be merged anytime soon?

duncanriach commented 4 years ago

Hi Thomas (@phqtuyen),

Thanks for letting us know. These ops seem to be generally high priority for determinism in TensorFlow.

@MFreidank: please can you give an update on this and/or move it forward. Most of the rest of the infrastructure is now in place for this.

MFreidank commented 4 years ago

Hi @phqtuyen, @duncanriach Sorry for the wait, I had been pulled into other things. I've picked up work on this now and will have an update to the PR before end of this week. I'll base my changes on the new infrastructure recently put in place.

duncanriach commented 4 years ago

Awesome. Just FYI, I'll be offline next week (2020-08-17 through 2020-08-21) and, since I'll need to be involved in reviews and iterations, that might slow things down a bit.

duncanriach commented 4 years ago

Hey @MFreidank, are you still planning on completing this PR?

duncanriach commented 4 years ago

Hi @MFreidank, I'm about to go offline again until 2020-09-16, but I wanted to check-in with you again. I think that this op's nondeterminism is near the top of the priority/urgency list for TensorFlow, and I want to get it resolved ASAP. Can you commit to pushing this through from your end?

MFreidank commented 4 years ago

Hi Duncan,

Sorry for the delay on this. Yes, I'll pick it up starting tomorrow my time.

Duncan Riach notifications@github.com schrieb am Fr., 4. Sep. 2020, 20:07:

Hi @Freidank, I'm about to go offline again until 2020-09-16, but I wanted to check-in with you again. I think that this op's nondeterminism is near the top of the priority/urgency list for TensorFlow, and I want to get it resolved ASAP. Can you commit to pushing this through from your end?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/framework-determinism/pull/21#issuecomment-687301639, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQSWKBQYI2MLT4QV42MP6DSEEUFJANCNFSM4OEKR34Q .

duncanriach commented 4 years ago

Thank you, Moritz.

MFreidank commented 4 years ago

Hi again @duncanriach, I've implemented a first version (on top of the most recent master branch with enable_determinism) that patches only tf.nn.sparse_softmax_cross_entropy_with_logits and tf.nn.softmax_cross_entropy_with_logits and integrated the unit test case you shared into test/test_softmax_cross_entropy_with_logits.py. To test that this works as intended I'll need access to my GPU which will only be available to me in another 6 hours. If the test cases pass, I'll update the PR with the above. What remains to be determined is which other functions need to be patched after that (by scanning through tensorflow/python as outlined by you above). I will first focus on getting the minimal patch for the functions in tf.nn to work including unit tests and will then broaden it to include the other functions.

duncanriach commented 4 years ago

Thanks, @MFreidank. Sounds good. Just to reiterate, I'll be offline until 2020-09-16 and I'll review this after that.

duncanriach commented 4 years ago

Hey @MFreidank, please will you submit those changes so that I can review them?

duncanriach commented 3 years ago

@MFreidank, we will soon release version 0.4.0. We intend to include the fused softmax/cross-entropy patch in version 0.5.0 and I want to get started on it ASAP. If I don't hear from you in the 24 hours, we will proceed to implement it without you.

NVIDIA / framework-reproducibility

[WIP] Feature/patch/softmax cross entropy with logits #21