awslabs / sagemaker-debugger

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors
Apache License 2.0
161 stars 83 forks source link

Fix bug in TF2 error handling #489

Closed ndodda-amazon closed 3 years ago

ndodda-amazon commented 3 years ago

Description of changes:

Creating a deep copy of the tape in wrap_tape for error handling introduced a bug that only surfaced for SMDDP distributed training. This change fixes that bug by removing the deep copy, which was ultimately unecesssary.

Also ran the full suite of profiler integration tests, showing that the SMDDP distributed training test passes with this change. https://tiny.amazon.com/19wb5qhx3/IsenLink

Style and formatting:

I have run pre-commit install to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

codecov-commenter commented 3 years ago

Codecov Report

Merging #489 (28e3b3d) into master (6ee3ec3) will decrease coverage by 0.46%. The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #489      +/-   ##
==========================================
- Coverage   64.35%   63.88%   -0.47%     
==========================================
  Files         174      164      -10     
  Lines       13370    13022     -348     
==========================================
- Hits         8604     8319     -285     
+ Misses       4766     4703      -63     
Impacted Files Coverage Δ
smdebug/tensorflow/keras.py 64.84% <0.00%> (+0.04%) :arrow_up:
smdebug/xgboost/utils.py 0.00% <0.00%> (-14.76%) :arrow_down:
smdebug/xgboost/hook.py 0.00% <0.00%> (-4.35%) :arrow_down:
smdebug/core/access_layer/s3.py 91.54% <0.00%> (-4.23%) :arrow_down:
smdebug/core/reader.py 85.18% <0.00%> (-3.71%) :arrow_down:
smdebug/exceptions.py 65.47% <0.00%> (-2.39%) :arrow_down:
smdebug/core/access_layer/file.py 96.00% <0.00%> (-2.00%) :arrow_down:
smdebug/core/tfrecord/tensor_reader.py 95.45% <0.00%> (-1.52%) :arrow_down:
smdebug/core/locations.py 86.11% <0.00%> (-1.39%) :arrow_down:
smdebug/core/tensor.py 79.03% <0.00%> (-1.21%) :arrow_down:
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6ee3ec3...28e3b3d. Read the comment docs.