facebookresearch / HolisticTraceAnalysis

A library to analyze PyTorch traces.
http://hta.readthedocs.io
MIT License
293 stars 40 forks source link

[critical path] Add graph validation checks and fix 0 duration stack issue. #116

Closed briancoutinho closed 7 months ago

briancoutinho commented 7 months ago

What does this PR do?

This PR does two things

  1. Zero duration cudaEventRecord events were actually causing a regression because of the stacks being incorrect. On landing this previous fix #114 that rolls back to older callstack code, it led to the stacks being generated wrong. Edges had negative weight and thus critical path analysis was showing up negative numbers. We are ok filtering 0 duration events since we now compute the cudaEventRecord dataframe on the full trace dataframe.
  2. Considering the above, I added a _validate_graph() function that can sanity check the cp graph. It finds negative weights and cycles right now.

Testing

This was the issue we saw with negative weights, Trace link https://www.internalfb.com/manifold/explorer/gpu_traces/tree/critical_path_tests/cmf30x

Screenshot 2024-03-21 at 6 18 27 PM Screenshot 2024-03-21 at 6 38 11 PM

After the fix we have good values and no negative weights. Screenshot 2024-03-21 at 6 37 58 PM

Screenshot 2024-03-21 at 6 20 54 PM

Unit Test

If i re-introduce the bug with 0 duration events and the validation immediately fails on unit tes Screenshot 2024-03-21 at 6 42 36 PM

Before submitting

facebook-github-bot commented 7 months ago

@briancoutinho has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 7 months ago

This pull request was exported from Phabricator. Differential Revision: D55254019

facebook-github-bot commented 7 months ago

@briancoutinho merged this pull request in facebookresearch/HolisticTraceAnalysis@4c0ad7ea03db0dbdb602f830f46f5d4cf14e7a38.