TUDB-Labs / mLoRA

An Efficient "Factory" to Build Multiple LoRA Adapters
Apache License 2.0
275 stars 52 forks source link

feature: add the profiler #189

Closed yezhengmao1 closed 8 months ago

yezhengmao1 commented 8 months ago

Introduction

  1. Use cuda nvtx to profile the mlora's performance.
  2. Use Pytorch's autograd framework to get the execution graphs and traces.

How to use profiler

  1. Add the --trace argument to enable the trace mode.
  2. Use nsys to profile. Your command like this: nsys profile -w true -t cuda,nvtx -s none -o test_report -f true -x true python mlora.py --base_model /data/Llama-2-7b-hf/ --load_8bit --device "cuda:0" --config ./config/dummy.json --trace
  3. Use NVIDIA Nsight Systems to analyze the profile file (I will later provide a CLI version to automatically generate some important summaries.). image

How to use traceviz

NOTE: only use this for debugging, so you need to add the function by yourself, (I prefer to add the function in recompute.py:70 like below) image

  1. Use the function 'trace' to get the dot file (add the trace mode can also get the range information about profiler).
  2. Use dot to convert the text file to graph, dot -Tsvg AddBackward0 -o graph.svg. image

Implementation Details

It's difficult to use PyTorch's profiler tool for performance measurement because it lacks an effective tracepoint function to distinguish different latencies.

So we must tag the computational graph during the forward process (and also tag the corresponding backward process)

Due to PyTorch's usage of JIT (Just-In-Time) technology to generate operators, there are currently no hook functions available to tag these operators during execution. Therefore, we need to add some global information in the backward propagation hook function to track the executed operators.