achew010 commented 1 month ago

Description

This PR adds GPU memory logging features to the benchmark script according to #8 and an updated benchmark README for usage instructions.

There are 2 approaches to logging memory,

Using Nvidia's SMI CLI tool
Using Huggingface's HFTrainer's API

Note: Issue #19 is created to address the grouping of memory values using a common prefix and will be addressed in future

Usage

1. Nvidia's SMI CLI tool

Set environment variable MEMORY_LOGGING=nvidia to use run_benchmarks.sh with nvidia logging

For each experiment,

Before the experiment calls subprocess.run, it will open an async nvidia-smi process to monitor only gpu indices in $CUDA_VISIBLE_DEVICES and log to FILE_MEM inside Experiment.save_dir
After the experiment subprocess call is completed. It terminates the async process.
At the end of each experiment, reads and aggregates an average memory over time (MiB / per sec) for each device and finally saves the average memory across all devices into gpu_mem in the main result logging function Experiment.write_results
Since it is an independent process called to measure the device, no expected slowdown in training speed

Each experiment directory will have a gpu log that contains <Timestamp>, <GPU Name>, <GPU ID>, <GPU Memory Used>

The memory readings will be reflected in the results raw_summary.csv under the column 'nvidia_mem_reserved' where the raw values are reported in MiB

2. Torch CUDA through Huggingface's HFTrainer's API

Set environment variable MEMORY_LOGGING=huggingface to use run_benchmarks.sh with huggingface logging (default)

HFTrainer has a feature to log memory through the skip_memory_metrics=False training argument. In their documentation, it is mentioned that setting this argument to False will affect training speed. In our tests so far (below), we do not see significant difference in throughput (tokens/sec) when using this argument.

A set of finegrain GPU readings will show as additional columns in the results raw_summary.csv where the raw values are reported in bytes

3. Log Both

Set environment variable MEMORY_LOGGING=all to use run_benchmarks.sh with both logging methods

4. Difference between Nvidia-SMI Utility and Torch CUDA through HFTrainer API

1. The Nvidia-SMI Utility is a coarse measurement tool that captures anything takes up GPU memory. It is simple and non-intrusive as it doesn't involve probing the trainer. It uses the NVML library to fetch reserved memory for each device ID -

All running computer processes using that device
All gpu memory used inside the training script
Any other caching of gpu memory by the training script

Note: To get accurate measurements, no other processes should be running on the device apart from the target process itself.

2. The HFTrainer API is a more precise tool that logs memory usage for a couple of operations inside HFTrainer

It uses torch.cuda.memory_allocated to probe the trainer by taking snapshots of allocated memory and storing the differences between the before and after of each stage. The following stages are probed -

Before Trainer init
Trainer.__init__,
Trainer.train,
Trainer.evaluate,
Trainer.predict.

Note: Any gpu memory accessed and used outside any of these stages or not part of HFTrainer will not be tracked. If the train script does not use the Huggingface trainer then this API will not work as well.

Note: Details on Memory Calculations from HFTrainer for GPTQ-LoRA + FSDP

This is an example of the memory values that HFTrainer will produce in the outputs of train()

output_metrics = {
    'train_runtime': 191.2491, 
    'train_samples_per_second': 0.209, 
    'train_steps_per_second': 0.052, 
    'train_tokens_per_second': 428.342, 
    'train_loss': 1.0627506256103516, 
    'init_mem_cpu_alloc_delta': 4096, 
    'init_mem_gpu_alloc_delta': 0, 
    'init_mem_cpu_peaked_delta': 0, 
    'init_mem_gpu_peaked_delta': 0, 
    'train_mem_cpu_alloc_delta': 839086080, 
    'train_mem_gpu_alloc_delta': -17491768832, 
    'train_mem_cpu_peaked_delta': 0, 
    'train_mem_gpu_peaked_delta': 26747825664, 
    'before_init_mem_cpu': 5513297920, 
    'before_init_mem_gpu': 36141687296, 
    'epoch': 0.01
}

We refer to the keys of the memory metrics in this order

before_init_mem_X as stage0
init_mem_X as stage1
train_mem_X as stage2
...

We currently compute the memory values in the report by taking the largest of sums. For example:

For allocated memory value

max([
  stage0_mem + stage1_allocated_delta, 
  stage0_mem + stage1_allocated_delta + stage2_allocated_delta,
  ...
])

For peak memory value

max([
  stage0_mem + stage1_allocated_delta + stage1_peaked_delta, 
  stage0_mem + stage1_allocated_delta + stage2_allocated_delta + stage2_peaked_delta,
  ...
])

Notice that we do not include stage0_mem alone when computing the max value. This is to avoid misleading comparisons between GPTQ-LoRA and other approaches that support low-memory mode. GPTQ-LoRA + FSDP currently does not support low-memory mode as mentioned #18.

The stage0_mem value of GPTQ-LoRA + FSDP will reflect a larger value as it is loaded fully before the trainer is initialized and then subsequently will be sharded internally in trainer.prepare.

This might cause some misleading comparisons when other variants are loaded in low-memory mode and have smaller stage0_mem memory consumption than GPTQ-LoRA + FSDP before its sharding. Once low-memory mode is supported for GPTQ-LoRA, we will include stage0_mem back inside the max computation

Tests

Memory Measurement Accuracy and Potential Side Effects

1. No Significant Slowdown From Using HFTrainer Memory Probes API on QLoRA Training

In both the Mistral7B model and Mixtral model, introducing the memory probes do not show a significant impact to the throughput of the training run (50 steps). Generally with larger batch sizes and models, the overhead of memory logging becomes insignificant.

A. <100 toks/sec slowdown after introducing the memory probes for Mistral,

model_name_or_path	num gpus	per device batch size	throughput with no mem probe (toks/sec)	throughput with mem probe (toks/sec)
mistralai/Mistral-7B-v0.1	1	4	3465	3432
mistralai/Mistral-7B-v0.1	2	2	2973	2931
mistralai/Mistral-7B-v0.1	1	8	3489	3508
mistralai/Mistral-7B-v0.1	2	4	3383	3298

B. <100 toks/sec slowdown after introducing the memory probes for Mixtral

model_name_or_path	num gpus	per device batch size	throughput with no_mem_probe (toks/sec)	throughput with mem_probe (toks/sec)
mistralai/Mixtral-8x7B-Instruct-v0.1	1	4	1785	1776
mistralai/Mixtral-8x7B-Instruct-v0.1	2	2	1518	1442
mistralai/Mixtral-8x7B-Instruct-v0.1	1	8	1938	1933
mistralai/Mixtral-8x7B-Instruct-v0.1	2	4	1757	1724

2. Torch/HF shows more granularity of memory usage with peak memory and actual allocated memory than Nvidia's reserved memory. This is more helpful when analyzing the actual memory allocated for each model.

We compare the 2 memory tracking methods (Nvidia vs Torch/HF) on single devices for both GPTQ-LoRA and QLoRA. Nvidia's peak mem reserved reports larger values than Torch/HF peak mem alloc, torch mem alloc shows the actual memory usage is lesser.

model_name_or_path	num gpus	per device batch size	peak nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)
mistralai/Mistral-7B-v0.1	1	4	19.46	15.86	4.84
TheBloke/Mistral-7B-v0.1-GPTQ	1	4	19.97	15.89	4.87
mistralai/Mixtral-8x7B-Instruct-v0.1	1	4	37.49	36.22	25.2
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ	1	4	36.59	35.53	24.51
NousResearch/Llama-2-70b-hf	1	4	71.12	68.16	37.35
TheBloke/Llama-2-70B-GPTQ	1	4	70.51	65.9	36.29

3. Memory Usage Decreases on Distributed Finetuning

When running large models on multiple devices, `torch mem alloc` shows the memory usage decreases as the models are sharded (Comparing to table above).	model_name_or_path	num gpus	per device batch size	peak nvidia mem reserved (GiB)	peak torch mem alloc (GiB)
mistralai/Mistral-7B-v0.1	2	4	20.97	16.59	2.73
TheBloke/Mistral-7B-v0.1-GPTQ	2	4	23.75	16.26	3.01
mistralai/Mixtral-8x7B-Instruct-v0.1	2	4	32.59	29.33	13.22
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ	2	4	51.1	27.74	12.79

Verified that torch mem alloc for GPTQ-LoRA on Llama2-70B hovers at 19GiB when sharded after trainer.prepare and during training. The values are similar to the manual probed values from this #15.

model_name_or_path	num gpus	per device batch size	peak nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)
NousResearch/Llama-2-70b-hf	2	2	51.49	46.52	19.17
TheBloke/Llama-2-70B-GPTQ	2	2	78.69	45.4	18.65

Benchmarks

Run tox -e run_benches to produce benchmarks. Full benchmark details can be referenced here

4. For small models, LoRA runs faster than the Quantized PEFT methods. One likely reason could be that it doesnt require an additional dequantization operation before the base layer+LoRA matmuls. While this is so, we also observe the significantly larger memory it consumed compared to the Quantized PEFT methods.

model_name_or_path	Training Type	Accel. Config Type	num gpus	per device batch size	peak nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)	throughput (toks/sec)
mistralai/Mistral-7B-v0.1	lora	none	1	4	29.03	26.11	15.12	3597
mistralai/Mistral-7B-v0.1	lora	accelerated-peft-bnb	1	4	19.46	15.86	4.84	3428
TheBloke/Mistral-7B-v0.1-GPTQ	lora	accelerated-peft-autogptq	1	4	19.97	15.89	4.87	3254

5. We observe that on single device finetuning for larger models (e.g. 49B Mixtral), PEFT begins to run out of memory while the Quantized PEFT methods continue to maintain low memory consumption.

model_name_or_path	Training Type	Accel. Config Type	num gpus	per device batch size	peak nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)	throughput (toks/sec)
mistralai/Mixtral-8x7B-Instruct-v0.1	none	none	1	4	79.14	OOM	OOM	OOM
mistralai/Mixtral-8x7B-Instruct-v0.1	lora	none	1	4	79.06	OOM	OOM	OOM
mistralai/Mixtral-8x7B-Instruct-v0.1	lora	baseline-peft-bnb	1	4	47.18	46.42	25.73	1396
mistralai/Mixtral-8x7B-Instruct-v0.1	lora	accelerated-peft-bnb	1	4	37.5	36.22	25.2	1764
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ	lora	accelerated-peft-autogptq	1	4	36.58	35.53	24.51	1864

6. In distributed finetuning for large models like Llama2-70B, GPTQ-LoRA shows the lowest memory consumption with the same throughput.

model_name_or_path	Training Type	Accel. Config Type	num gpus	per device batch size	peak nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)	throughput (toks/sec)
NousResearch/Llama-2-70b-hf	lora	none	2	2	79	OOM	OOM	OOM
NousResearch/Llama-2-70b-hf	lora	accelerated-peft-bnb	2	2	51.4	46.52	19.17	418
TheBloke/Llama-2-70B-GPTQ	lora	accelerated-peft-autogptq	2	2	78.5	45.4	18.65	426