awslabs / sagemaker-debugger

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors
Apache License 2.0
161 stars 83 forks source link

Use SMP rank and size when applicable #411

Closed rahul003 closed 3 years ago

rahul003 commented 3 years ago

Description of changes:

When SMP is used together with Horovod, there'll be multiple horovod 'groups'. Rank and size need to be queried from SMP in such cases.

This is not a problem for Pytorch as there is a single torch.distributed group there or MXNet as SMP doesn't support MXNet.

codecov-io commented 3 years ago

Codecov Report

Merging #411 (42d300b) into master (9d2d0c3) will decrease coverage by 1.57%. The diff coverage is 62.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #411      +/-   ##
==========================================
- Coverage   77.70%   76.13%   -1.58%     
==========================================
  Files         113      113              
  Lines       10139    10165      +26     
==========================================
- Hits         7879     7739     -140     
- Misses       2260     2426     +166     
Impacted Files Coverage Δ
smdebug/tensorflow/base_hook.py 70.28% <18.18%> (-5.75%) :arrow_down:
smdebug/core/utils.py 79.72% <74.41%> (+1.98%) :arrow_up:
smdebug/tensorflow/callable_cache.py 52.17% <0.00%> (-26.09%) :arrow_down:
smdebug/tensorflow/utils.py 64.59% <0.00%> (-23.45%) :arrow_down:
smdebug/tensorflow/singleton_utils.py 83.33% <0.00%> (-16.67%) :arrow_down:
smdebug/profiler/tf_profiler_parser.py 54.54% <0.00%> (-11.58%) :arrow_down:
smdebug/tensorflow/collection.py 84.53% <0.00%> (-11.35%) :arrow_down:
smdebug/tensorflow/keras.py 79.21% <0.00%> (-11.00%) :arrow_down:
smdebug/rules/action/stop_training_action.py 56.45% <0.00%> (-9.68%) :arrow_down:
smdebug/core/logger.py 66.12% <0.00%> (-8.07%) :arrow_down:
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 9d2d0c3...42d300b. Read the comment docs.

rahul003 commented 3 years ago

ok, will do