awslabs / sagemaker-debugger

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors
Apache License 2.0
161 stars 83 forks source link

SMDDP should use size() and rank() for TF jobs #451

Closed ndodda-amazon closed 3 years ago

ndodda-amazon commented 3 years ago

Description of changes:

Fixing a bug introduced by #425 where get_size() and get_rank() were used for all jobs, when the correct API for TF jobs is actually is actually size() and rank().

This bug was missed before because we do not have any tests that run TF SMDDP jobs.

Style and formatting:

I have run pre-commit install to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

codecov-io commented 3 years ago

Codecov Report

Merging #451 (d9598e1) into master (0131559) will decrease coverage by 0.57%. The diff coverage is 25.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #451      +/-   ##
==========================================
- Coverage   65.86%   65.28%   -0.58%     
==========================================
  Files         161      162       +1     
  Lines       12747    12889     +142     
==========================================
+ Hits         8396     8415      +19     
- Misses       4351     4474     +123     
Impacted Files Coverage Δ
smdebug/core/utils.py 78.49% <25.00%> (ø)
smdebug_rules/generic/create_xgboost_report.py 17.70% <0.00%> (-8.39%) :arrow_down:
smdebug/pytorch/hook.py 76.03% <0.00%> (-0.64%) :arrow_down:
smdebug/core/writer.py 90.09% <0.00%> (ø)
smdebug_rules/generic/__init__.py 100.00% <0.00%> (ø)
smdebug_rules/generic/constants.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0131559...d9598e1. Read the comment docs.

ChoiByungWook commented 3 years ago

User is reporting a similar issue: https://github.com/aws/sagemaker-python-sdk/issues/2170