Closed briancoutinho closed 7 months ago
@briancoutinho has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
This pull request was exported from Phabricator. Differential Revision: D55254019
@briancoutinho merged this pull request in facebookresearch/HolisticTraceAnalysis@4c0ad7ea03db0dbdb602f830f46f5d4cf14e7a38.
What does this PR do?
This PR does two things
cudaEventRecord
events were actually causing a regression because of the stacks being incorrect. On landing this previous fix #114 that rolls back to older callstack code, it led to the stacks being generated wrong. Edges had negative weight and thus critical path analysis was showing up negative numbers. We are ok filtering 0 duration events since we now compute thecudaEventRecord
dataframe on the full trace dataframe._validate_graph()
function that can sanity check the cp graph. It finds negative weights and cycles right now.Testing
This was the issue we saw with negative weights, Trace link https://www.internalfb.com/manifold/explorer/gpu_traces/tree/critical_path_tests/cmf30x
After the fix we have good values and no negative weights.
Unit Test
If i re-introduce the bug with 0 duration events and the validation immediately fails on unit tes
Before submitting