awslabs / sagemaker-debugger

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors
Apache License 2.0
161 stars 83 forks source link

Profiler tf native training #420

Open sophiayue1116 opened 3 years ago

sophiayue1116 commented 3 years ago

Description of changes:

This commit is to enable profiler in the tf2 native training (design doc: https://quip-amazon.com/v0MwAkTizZl9/Profiler-for-TensorFlow2-native-training). The corresponding integration tests for tf 2.2 and 2.3 passed successfully. TF2.2 integration test: https://console.aws.amazon.com/codesuite/codebuild/072677473360/projects/smprofiler_tf2_integration_tests/build/smprofiler_tf2_integration_tests%3A2bff3f63-b797-4c5e-9992-0fdf17f13bec?region=us-east-1 TF2.3 integration test: https://console.aws.amazon.com/codesuite/codebuild/072677473360/projects/smprofiler_tf_2_3_integration_tests/build/smprofiler_tf_2_3_integration_tests%3A801a9a02-30b5-4af6-9835-9b01a6ed6ce4/?region=us-east-1

The changes include:

  1. Added profiling_start_batch(), profiling_end_batch() and profiling_end() functions inside keras.py to enable the profiler functionalities in the native train loop.
  2. Added python_profiler as KerasHook's attribute to have a better practice and be better for testing the python profiling.
  3. Added is_profiler_native_training (default to False) as KerasHook's attribute to indicate enabling profiler in the tensorflow2 native training. It is used to handle the different use cases (only debugger enabled, only profiler enabled, both debugger and profiler enabled).
  4. Added _decrement_step() function to decrease the step number when both profiler and debugger are enabled. In this case, step number is first increased by 1 inside profiling_start_batch() and decreased by 1 inside wrap_tape() before calling the _wrap_tape_push() function, in order to keep the debugger code unchanged inside _wrap_tape_push() function.
  5. Added _handle_start_python_profiling(), _handle_end_python_profiling(), _handle_start_detailed_profiling(), _handle_end_detailed_profiling(), _handle_start_dataloader_profiling(), _handle_end_dataloader_profiling() methods inside keras.py to reduce the code.
  6. Updated _increment_step() function in the hook.py to be able to separate the functionalities of step increase and write state.
  7. Added unit tests for profiler only and profiler + debugger use cases.

Style and formatting:

I have run pre-commit install to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

codecov-io commented 3 years ago

Codecov Report

Merging #420 (f704e48) into master (6788e32) will decrease coverage by 14.18%. The diff coverage is 5.68%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #420       +/-   ##
===========================================
- Coverage   76.91%   62.72%   -14.19%     
===========================================
  Files         113      113               
  Lines       10195    10237       +42     
===========================================
- Hits         7841     6421     -1420     
- Misses       2354     3816     +1462     
Impacted Files Coverage Δ
smdebug/tensorflow/keras.py 0.00% <0.00%> (-90.10%) :arrow_down:
smdebug/core/hook.py 89.33% <100.00%> (-4.56%) :arrow_down:
smdebug/tensorflow/__init__.py 0.00% <0.00%> (-100.00%) :arrow_down:
smdebug/tensorflow/constants.py 0.00% <0.00%> (-100.00%) :arrow_down:
smdebug/tensorflow/singleton_utils.py 0.00% <0.00%> (-100.00%) :arrow_down:
smdebug/tensorflow/collection.py 0.00% <0.00%> (-95.88%) :arrow_down:
smdebug/tensorflow/session.py 0.00% <0.00%> (-91.83%) :arrow_down:
smdebug/tensorflow/tensor_ref.py 0.00% <0.00%> (-88.71%) :arrow_down:
smdebug/tensorflow/utils.py 0.00% <0.00%> (-87.62%) :arrow_down:
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6788e32...f704e48. Read the comment docs.