awslabs / sagemaker-debugger

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors
Apache License 2.0
161 stars 83 forks source link

list training jobs improvements #349

Closed Vikas-kum closed 4 years ago

Vikas-kum commented 4 years ago

Description of changes:

Earlier list training job would make 50 attempts irrespective. This may be bad because of unnecessary traffic. We have changed attempts - 1) if there are training jobs found with prefix, we break 2) if there are exceptions caught more than 5 times we break.

Style and formatting:

I have run pre-commit install to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

codecov-commenter commented 4 years ago

Codecov Report

Merging #349 into master will increase coverage by 3.54%. The diff coverage is 33.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #349      +/-   ##
==========================================
+ Coverage   81.96%   85.51%   +3.54%     
==========================================
  Files          86       86              
  Lines        6460     6468       +8     
==========================================
+ Hits         5295     5531     +236     
+ Misses       1165      937     -228     
Impacted Files Coverage Δ
smdebug/rules/action/stop_training_action.py 77.41% <33.33%> (-0.36%) :arrow_down:
smdebug/core/utils.py 84.03% <0.00%> (+0.46%) :arrow_up:
smdebug/core/json_config.py 96.74% <0.00%> (+0.81%) :arrow_up:
smdebug/core/hook.py 93.45% <0.00%> (+1.78%) :arrow_up:
smdebug/tensorflow/collection.py 95.87% <0.00%> (+2.06%) :arrow_up:
smdebug/core/reductions.py 93.47% <0.00%> (+2.17%) :arrow_up:
smdebug/tensorflow/tensor_ref.py 88.70% <0.00%> (+3.22%) :arrow_up:
smdebug/tensorflow/base_hook.py 80.55% <0.00%> (+5.15%) :arrow_up:
smdebug/core/singleton_utils.py 85.29% <0.00%> (+5.88%) :arrow_up:
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 40582f5...d5f7ec2. Read the comment docs.

NihalHarish commented 4 years ago
  1. if there are training jobs found with prefix, we break
  2. if there are exceptions caught more than 5 times we break.

Can we add tests for both these behaviors?