awslabs / sagemaker-debugger

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors
Apache License 2.0
161 stars 83 forks source link

Updated heatmap visualizations #481

Closed NRauschmayr closed 3 years ago

NRauschmayr commented 3 years ago

Description of changes:

I updated the heatmap to provide better visualizations for large scale distributed training jobs for instance users can now aggregate metrics per worker node. Before aggregation per worker node was not supported, so the visualization becomes very overwhelming see the following image which shows utilization per GPU and CPU core image With the updated heatmap, users can aggregate utilization per worker node like in the following image:
Screen Shot 2021-04-13 at 12 24 43 PM

Style and formatting:

I have run pre-commit install to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

codecov-io commented 3 years ago

Codecov Report

Merging #481 (a2d627b) into master (a7c697f) will decrease coverage by 1.49%. The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #481      +/-   ##
==========================================
- Coverage   67.06%   65.57%   -1.50%     
==========================================
  Files         173      163      -10     
  Lines       13280    12917     -363     
==========================================
- Hits         8906     8470     -436     
- Misses       4374     4447      +73     
Impacted Files Coverage Δ
...mdebug/profiler/analysis/notebook_utils/heatmap.py 0.00% <0.00%> (ø)
smdebug/mxnet/graph.py 25.80% <0.00%> (-32.26%) :arrow_down:
...profiler/analysis/utils/profiler_data_to_pandas.py 36.07% <0.00%> (-28.77%) :arrow_down:
smdebug/mxnet/utils.py 59.37% <0.00%> (-28.13%) :arrow_down:
...ler/analysis/notebook_utils/step_timeline_chart.py 0.00% <0.00%> (-21.32%) :arrow_down:
smdebug/core/logger.py 70.83% <0.00%> (-12.50%) :arrow_down:
smdebug/core/tfevent/summary.py 81.35% <0.00%> (-11.87%) :arrow_down:
smdebug/core/reduction_config.py 88.31% <0.00%> (-7.80%) :arrow_down:
smdebug/core/utils.py 78.15% <0.00%> (-7.17%) :arrow_down:
smdebug/core/singleton_utils.py 85.29% <0.00%> (-5.89%) :arrow_down:
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a7c697f...a2d627b. Read the comment docs.

NihalHarish commented 3 years ago

We need to add a way to test visualizations going forward; for now I will approve these changes.

atqy commented 2 years ago

Just want to ask, why is the update_data function deleted? What is the alternative to that?