feature: add the profiler

Introduction

Use cuda nvtx to profile the mlora's performance.
Use Pytorch's autograd framework to get the execution graphs and traces.

How to use profiler

Add the --trace argument to enable the trace mode.
Use nsys to profile. Your command like this: nsys profile -w true -t cuda,nvtx -s none -o test_report -f true -x true python mlora.py --base_model /data/Llama-2-7b-hf/ --load_8bit --device "cuda:0" --config ./config/dummy.json --trace
Use NVIDIA Nsight Systems to analyze the profile file (I will later provide a CLI version to automatically generate some important summaries.).

How to use traceviz

NOTE: only use this for debugging, so you need to add the function by yourself, (I prefer to add the function in recompute.py:70 like below)

Use the function 'trace' to get the dot file (add the trace mode can also get the range information about profiler).
Use dot to convert the text file to graph, dot -Tsvg AddBackward0 -o graph.svg.

Implementation Details

It's difficult to use PyTorch's profiler tool for performance measurement because it lacks an effective tracepoint function to distinguish different latencies.

So we must tag the computational graph during the forward process (and also tag the corresponding backward process)

Due to PyTorch's usage of JIT (Just-In-Time) technology to generate operators, there are currently no hook functions available to tag these operators during execution. Therefore, we need to add some global information in the backward propagation hook function to track the executed operators.

TUDB-Labs / mLoRA