Use Pytorch's autograd framework to get the execution graphs and traces.
How to use profiler
Add the --trace argument to enable the trace mode.
Use nsys to profile.
Your command like this: nsys profile -w true -t cuda,nvtx -s none -o test_report -f true -x true python mlora.py --base_model /data/Llama-2-7b-hf/ --load_8bit --device "cuda:0" --config ./config/dummy.json --trace
Use NVIDIA Nsight Systems to analyze the profile file (I will later provide a CLI version to automatically generate some important summaries.).
How to use traceviz
NOTE: only use this for debugging, so you need to add the function by yourself, (I prefer to add the function in recompute.py:70 like below)
Use the function 'trace' to get the dot file (add the trace mode can also get the range information about profiler).
Use dot to convert the text file to graph, dot -Tsvg AddBackward0 -o graph.svg.
Implementation Details
It's difficult to use PyTorch's profiler tool for performance measurement because it lacks an effective tracepoint function to distinguish different latencies.
So we must tag the computational graph during the forward process (and also tag the corresponding backward process)
Due to PyTorch's usage of JIT (Just-In-Time) technology to generate operators, there are currently no hook functions available to tag these operators during execution. Therefore, we need to add some global information in the backward propagation hook function to track the executed operators.
Introduction
autograd
framework to get the execution graphs and traces.How to use profiler
--trace
argument to enable the trace mode.nsys
to profile. Your command like this:nsys profile -w true -t cuda,nvtx -s none -o test_report -f true -x true python mlora.py --base_model /data/Llama-2-7b-hf/ --load_8bit --device "cuda:0" --config ./config/dummy.json --trace
NVIDIA Nsight Systems
to analyze the profile file (I will later provide a CLI version to automatically generate some important summaries.).How to use traceviz
NOTE: only use this for debugging, so you need to add the function by yourself, (I prefer to add the function in recompute.py:70 like below)
dot -Tsvg AddBackward0 -o graph.svg
.Implementation Details
It's difficult to use PyTorch's profiler tool for performance measurement because it lacks an effective
tracepoint
function to distinguish different latencies.So we must tag the computational graph during the forward process (and also tag the corresponding backward process)
Due to PyTorch's usage of JIT (Just-In-Time) technology to generate operators, there are currently no hook functions available to tag these operators during execution. Therefore, we need to add some global information in the backward propagation hook function to track the executed operators.