foundation-model-stack / fms-acceleration

🚀 Collection of libraries used with fms-hf-tuning to accelerate fine-tuning and training of large models.
Apache License 2.0
0 stars 4 forks source link

Provide Memory Benchmarking Feature to Benchmarking Code #14

Closed achew010 closed 1 month ago

achew010 commented 1 month ago

Description

This PR adds GPU memory logging features to the benchmark script according to #8 and an updated benchmark README for usage instructions.

There are 2 approaches to logging memory,

Note: Issue #19 is created to address the grouping of memory values using a common prefix and will be addressed in future

Usage

1. Nvidia's SMI CLI tool

Set environment variable MEMORY_LOGGING=nvidia to use run_benchmarks.sh with nvidia logging

For each experiment,

Each experiment directory will have a gpu log that contains <Timestamp>, <GPU Name>, <GPU ID>, <GPU Memory Used>

The memory readings will be reflected in the results raw_summary.csv under the column 'nvidia_mem_reserved' where the raw values are reported in MiB

2. Torch CUDA through Huggingface's HFTrainer's API

Set environment variable MEMORY_LOGGING=huggingface to use run_benchmarks.sh with huggingface logging (default)

HFTrainer has a feature to log memory through the skip_memory_metrics=False training argument. In their documentation, it is mentioned that setting this argument to False will affect training speed. In our tests so far (below), we do not see significant difference in throughput (tokens/sec) when using this argument.

A set of finegrain GPU readings will show as additional columns in the results raw_summary.csv where the raw values are reported in bytes

image

3. Log Both

Set environment variable MEMORY_LOGGING=all to use run_benchmarks.sh with both logging methods

4. Difference between Nvidia-SMI Utility and Torch CUDA through HFTrainer API

1. The Nvidia-SMI Utility is a coarse measurement tool that captures anything takes up GPU memory. It is simple and non-intrusive as it doesn't involve probing the trainer. It uses the NVML library to fetch reserved memory for each device ID -

Note: To get accurate measurements, no other processes should be running on the device apart from the target process itself.

2. The HFTrainer API is a more precise tool that logs memory usage for a couple of operations inside HFTrainer

It uses torch.cuda.memory_allocated to probe the trainer by taking snapshots of allocated memory and storing the differences between the before and after of each stage. The following stages are probed -

Note: Any gpu memory accessed and used outside any of these stages or not part of HFTrainer will not be tracked. If the train script does not use the Huggingface trainer then this API will not work as well.

Note: Details on Memory Calculations from HFTrainer for GPTQ-LoRA + FSDP

This is an example of the memory values that HFTrainer will produce in the outputs of train()

output_metrics = {
    'train_runtime': 191.2491, 
    'train_samples_per_second': 0.209, 
    'train_steps_per_second': 0.052, 
    'train_tokens_per_second': 428.342, 
    'train_loss': 1.0627506256103516, 
    'init_mem_cpu_alloc_delta': 4096, 
    'init_mem_gpu_alloc_delta': 0, 
    'init_mem_cpu_peaked_delta': 0, 
    'init_mem_gpu_peaked_delta': 0, 
    'train_mem_cpu_alloc_delta': 839086080, 
    'train_mem_gpu_alloc_delta': -17491768832, 
    'train_mem_cpu_peaked_delta': 0, 
    'train_mem_gpu_peaked_delta': 26747825664, 
    'before_init_mem_cpu': 5513297920, 
    'before_init_mem_gpu': 36141687296, 
    'epoch': 0.01
}

We refer to the keys of the memory metrics in this order

We currently compute the memory values in the report by taking the largest of sums. For example:

For allocated memory value

max([
  stage0_mem + stage1_allocated_delta, 
  stage0_mem + stage1_allocated_delta + stage2_allocated_delta,
  ...
])

For peak memory value

max([
  stage0_mem + stage1_allocated_delta + stage1_peaked_delta, 
  stage0_mem + stage1_allocated_delta + stage2_allocated_delta + stage2_peaked_delta,
  ...
])

Notice that we do not include stage0_mem alone when computing the max value. This is to avoid misleading comparisons between GPTQ-LoRA and other approaches that support low-memory mode. GPTQ-LoRA + FSDP currently does not support low-memory mode as mentioned #18.

The stage0_mem value of GPTQ-LoRA + FSDP will reflect a larger value as it is loaded fully before the trainer is initialized and then subsequently will be sharded internally in trainer.prepare.

This might cause some misleading comparisons when other variants are loaded in low-memory mode and have smaller stage0_mem memory consumption than GPTQ-LoRA + FSDP before its sharding. Once low-memory mode is supported for GPTQ-LoRA, we will include stage0_mem back inside the max computation

Tests

Memory Measurement Accuracy and Potential Side Effects

1. No Significant Slowdown From Using HFTrainer Memory Probes API on QLoRA Training

In both the Mistral7B model and Mixtral model, introducing the memory probes do not show a significant impact to the throughput of the training run (50 steps). Generally with larger batch sizes and models, the overhead of memory logging becomes insignificant.

A. <100 toks/sec slowdown after introducing the memory probes for Mistral,

model_name_or_path num
gpus
per
device
batch
size
throughput
with
no mem probe
(toks/sec)
throughput
with
mem probe
(toks/sec)
mistralai/Mistral-7B-v0.1 1 4 3465 3432
mistralai/Mistral-7B-v0.1 2 2 2973 2931
mistralai/Mistral-7B-v0.1 1 8 3489 3508
mistralai/Mistral-7B-v0.1 2 4 3383 3298

B. <100 toks/sec slowdown after introducing the memory probes for Mixtral

model_name_or_path num
gpus
per
device
batch
size
throughput
with
no_mem_probe
(toks/sec)
throughput
with
mem_probe
(toks/sec)
mistralai/Mixtral-8x7B-Instruct-v0.1 1 4 1785 1776
mistralai/Mixtral-8x7B-Instruct-v0.1 2 2 1518 1442
mistralai/Mixtral-8x7B-Instruct-v0.1 1 8 1938 1933
mistralai/Mixtral-8x7B-Instruct-v0.1 2 4 1757 1724

2. Torch/HF shows more granularity of memory usage with peak memory and actual allocated memory than Nvidia's reserved memory. This is more helpful when analyzing the actual memory allocated for each model.

We compare the 2 memory tracking methods (Nvidia vs Torch/HF) on single devices for both GPTQ-LoRA and QLoRA. Nvidia's peak mem reserved reports larger values than Torch/HF peak mem alloc, torch mem alloc shows the actual memory usage is lesser.

model_name_or_path num
gpus
per
device
batch
size
peak nvidia mem reserved
(GiB)
peak torch mem alloc
(GiB)
torch mem alloc
(GiB)
mistralai/Mistral-7B-v0.1 1 4 19.46 15.86 4.84
TheBloke/Mistral-7B-v0.1-GPTQ 1 4 19.97 15.89 4.87
mistralai/Mixtral-8x7B-Instruct-v0.1 1 4 37.49 36.22 25.2
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ 1 4 36.59 35.53 24.51
NousResearch/Llama-2-70b-hf 1 4 71.12 68.16 37.35
TheBloke/Llama-2-70B-GPTQ 1 4 70.51 65.9 36.29

3. Memory Usage Decreases on Distributed Finetuning

When running large models on multiple devices, torch mem alloc shows the memory usage decreases as the models are sharded (Comparing to table above). model_name_or_path num
gpus
per
device
batch
size
peak nvidia mem reserved
(GiB)
peak torch mem alloc
(GiB)
torch mem alloc
(GiB)
mistralai/Mistral-7B-v0.1 2 4 20.97 16.59 2.73
TheBloke/Mistral-7B-v0.1-GPTQ 2 4 23.75 16.26 3.01
mistralai/Mixtral-8x7B-Instruct-v0.1 2 4 32.59 29.33 13.22
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ 2 4 51.1 27.74 12.79

Verified that torch mem alloc for GPTQ-LoRA on Llama2-70B hovers at 19GiB when sharded after trainer.prepare and during training. The values are similar to the manual probed values from this #15.

model_name_or_path num
gpus
per
device
batch
size
peak nvidia mem reserved
(GiB)
peak torch mem alloc
(GiB)
torch mem alloc
(GiB)
NousResearch/Llama-2-70b-hf 2 2 51.49 46.52 19.17
TheBloke/Llama-2-70B-GPTQ 2 2 78.69 45.4 18.65

Benchmarks

Run tox -e run_benches to produce benchmarks. Full benchmark details can be referenced here

4. For small models, LoRA runs faster than the Quantized PEFT methods. One likely reason could be that it doesnt require an additional dequantization operation before the base layer+LoRA matmuls. While this is so, we also observe the significantly larger memory it consumed compared to the Quantized PEFT methods.

model_name_or_path Training
Type
Accel.
Config
Type
num
gpus
per
device
batch
size
peak nvidia mem reserved
(GiB)
peak torch mem alloc
(GiB)
torch mem alloc
(GiB)
throughput
(toks/sec)
mistralai/Mistral-7B-v0.1 lora none 1 4 29.03 26.11 15.12 3597
mistralai/Mistral-7B-v0.1 lora accelerated-peft-bnb 1 4 19.46 15.86 4.84 3428
TheBloke/Mistral-7B-v0.1-GPTQ lora accelerated-peft-autogptq 1 4 19.97 15.89 4.87 3254

5. We observe that on single device finetuning for larger models (e.g. 49B Mixtral), PEFT begins to run out of memory while the Quantized PEFT methods continue to maintain low memory consumption.

model_name_or_path Training
Type
Accel.
Config
Type
num
gpus
per
device
batch
size
peak nvidia mem reserved
(GiB)
peak torch mem alloc
(GiB)
torch mem alloc
(GiB)
throughput
(toks/sec)
mistralai/Mixtral-8x7B-Instruct-v0.1 none none 1 4 79.14 OOM OOM OOM
mistralai/Mixtral-8x7B-Instruct-v0.1 lora none 1 4 79.06 OOM OOM OOM
mistralai/Mixtral-8x7B-Instruct-v0.1 lora baseline-peft-bnb 1 4 47.18 46.42 25.73 1396
mistralai/Mixtral-8x7B-Instruct-v0.1 lora accelerated-peft-bnb 1 4 37.5 36.22 25.2 1764
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ lora accelerated-peft-autogptq 1 4 36.58 35.53 24.51 1864

6. In distributed finetuning for large models like Llama2-70B, GPTQ-LoRA shows the lowest memory consumption with the same throughput.

model_name_or_path Training
Type
Accel.
Config
Type
num
gpus
per
device
batch
size
peak nvidia mem reserved
(GiB)
peak torch mem alloc
(GiB)
torch mem alloc
(GiB)
throughput
(toks/sec)
NousResearch/Llama-2-70b-hf lora none 2 2 79 OOM OOM OOM
NousResearch/Llama-2-70b-hf lora accelerated-peft-bnb 2 2 51.4 46.52 19.17 418
TheBloke/Llama-2-70B-GPTQ lora accelerated-peft-autogptq 2 2 78.5 45.4 18.65 426

7. Increasing the batch size, GPTQ-LoRA is the only experiment that doesnt run out of memory.

model_name_or_path Training
Type
Accel.
Config
Type
num
gpus
per
device
batch
size
peak nvidia mem reserved
(GiB)
peak torch mem alloc
(GiB)
torch mem alloc
(GiB)
throughput
(toks/sec)
NousResearch/Llama-2-70b-hf lora none 2 4 OOM OOM OOM OOM
NousResearch/Llama-2-70b-hf lora accelerated-peft-bnb 2 4 OOM OOM OOM OOM
TheBloke/Llama-2-70B-GPTQ lora accelerated-peft-autogptq 2 4 78.48 70.67 18.65 451
fabianlim commented 1 month ago

can you run tox -e lint from the top level directory, the linting is not automated yet #7 . Also do we plan to activate the memory by default in run_benchmarks.sh?

achew010 commented 1 month ago

can you run tox -e lint from the top level directory, the linting is not automated yet #7 . Also do we plan to activate the memory by default in run_benchmarks.sh?

@fabianlim okay linted. Do we want to set nvidia-smi to be the default mem logging approach or once we establish that the speed degradation is insignificant with using the HF memory logging API to use this instead?

fabianlim commented 1 month ago

okay linted. Do we want to set nvidia-smi to be the default mem logging approach or once we establish that the speed degradation is insignificant with using the HF memory logging API to use this instead?

@achew010 lets merge that commit on top of this one and let me review.

fabianlim commented 1 month ago

@achew010 i approived. after we update the csv we can merge. Also can you run a tox -e lint

fabianlim commented 1 month ago

We have noted the memory keys should be renamed; to be addressed later in https://github.com/foundation-model-stack/fms-acceleration/issues/19

fabianlim commented 1 month ago

@achew010 can we move all the memory computation logic out of write_result into gather_report. That way the results.json only holds the raw data. Then gather_report can hold all the logic to preprocess the data for human consumption.

fabianlim commented 1 month ago

@achew010 also one more consideration is that memory we should only have huggingface mem probes in the benchmark.csv. ~This is because the command.sh cannot easily replay the nvidia-smi measurements.~ Actually maybe there are more issues, because results.json is not even properly populated by command.sh.

Or unless we have the tool do a proper replay and start the nvidia-smi properly. Update addressed in below commit.