Closed achew010 closed 1 month ago
can you run tox -e lint
from the top level directory, the linting is not automated yet #7 . Also do we plan to activate the memory by default in run_benchmarks.sh
?
can you run
tox -e lint
from the top level directory, the linting is not automated yet #7 . Also do we plan to activate the memory by default inrun_benchmarks.sh
?
@fabianlim okay linted. Do we want to set nvidia-smi to be the default mem logging approach or once we establish that the speed degradation is insignificant with using the HF memory logging API to use this instead?
okay linted. Do we want to set nvidia-smi to be the default mem logging approach or once we establish that the speed degradation is insignificant with using the HF memory logging API to use this instead?
@achew010 lets merge that commit on top of this one and let me review.
@achew010 i approived. after we update the csv we can merge. Also can you run a tox -e lint
We have noted the memory keys should be renamed; to be addressed later in https://github.com/foundation-model-stack/fms-acceleration/issues/19
@achew010 can we move all the memory computation logic out of write_result
into gather_report
. That way the results.json
only holds the raw data. Then gather_report
can hold all the logic to preprocess the data for human consumption.
@achew010 also one more consideration is that memory we should only have huggingface
mem probes in the benchmark.csv
. ~This is because the command.sh
cannot easily replay the nvidia-smi
measurements.~ Actually maybe there are more issues, because results.json
is not even properly populated by command.sh
.
Or unless we have the tool do a proper replay and start the nvidia-smi
properly. Update addressed in below commit.
Description
This PR adds GPU memory logging features to the benchmark script according to #8 and an updated benchmark README for usage instructions.
There are 2 approaches to logging memory,
Note: Issue #19 is created to address the grouping of memory values using a common prefix and will be addressed in future
Usage
1. Nvidia's SMI CLI tool
Set environment variable
MEMORY_LOGGING=nvidia
to userun_benchmarks.sh
with nvidia loggingFor each experiment,
subprocess.run
, it will open an asyncnvidia-smi
process to monitor only gpu indices in$CUDA_VISIBLE_DEVICES
and log toFILE_MEM
insideExperiment.save_dir
subprocess
call is completed. It terminates the async process.gpu_mem
in the main result logging functionExperiment.write_results
Each experiment directory will have a gpu log that contains
<Timestamp>, <GPU Name>, <GPU ID>, <GPU Memory Used>
The memory readings will be reflected in the results
raw_summary.csv
under the column'nvidia_mem_reserved'
where the raw values are reported inMiB
2. Torch CUDA through Huggingface's HFTrainer's API
Set environment variable
MEMORY_LOGGING=huggingface
to userun_benchmarks.sh
with huggingface logging (default)HFTrainer has a feature to log memory through the
skip_memory_metrics=False
training argument. In their documentation, it is mentioned that setting this argument toFalse
will affect training speed. In our tests so far (below), we do not see significant difference in throughput (tokens/sec) when using this argument.A set of finegrain GPU readings will show as additional columns in the results
raw_summary.csv
where the raw values are reported inbytes
3. Log Both
Set environment variable
MEMORY_LOGGING=all
to userun_benchmarks.sh
with both logging methods4. Difference between Nvidia-SMI Utility and Torch CUDA through HFTrainer API
1. The Nvidia-SMI Utility is a coarse measurement tool that captures anything takes up GPU memory. It is simple and non-intrusive as it doesn't involve probing the trainer. It uses the NVML library to fetch reserved memory for each device ID -
Note: To get accurate measurements, no other processes should be running on the device apart from the target process itself.
2. The HFTrainer API is a more precise tool that logs memory usage for a couple of operations inside HFTrainer
It uses
torch.cuda.memory_allocated
to probe the trainer by taking snapshots of allocated memory and storing the differences between the before and after of each stage. The following stages are probed -Trainer.__init__
,Trainer.train
,Trainer.evaluate
,Trainer.predict
.Note: Any gpu memory accessed and used outside any of these stages or not part of HFTrainer will not be tracked. If the train script does not use the Huggingface trainer then this API will not work as well.
Note: Details on Memory Calculations from HFTrainer for GPTQ-LoRA + FSDP
This is an example of the memory values that HFTrainer will produce in the outputs of
train()
We refer to the keys of the memory metrics in this order
before_init_mem_X
as stage0init_mem_X
as stage1train_mem_X
as stage2We currently compute the memory values in the report by taking the largest of sums. For example:
For allocated memory value
For peak memory value
Notice that we do not include
stage0_mem
alone when computing the max value. This is to avoid misleading comparisons between GPTQ-LoRA and other approaches that support low-memory mode. GPTQ-LoRA + FSDP currently does not support low-memory mode as mentioned #18.The
stage0_mem
value of GPTQ-LoRA + FSDP will reflect a larger value as it is loaded fully before the trainer is initialized and then subsequently will be sharded internally intrainer.prepare
.This might cause some misleading comparisons when other variants are loaded in low-memory mode and have smaller
stage0_mem
memory consumption than GPTQ-LoRA + FSDP before its sharding. Once low-memory mode is supported for GPTQ-LoRA, we will includestage0_mem
back inside the max computationTests
Memory Measurement Accuracy and Potential Side Effects
1. No Significant Slowdown From Using HFTrainer Memory Probes API on QLoRA Training
In both the Mistral7B model and Mixtral model, introducing the memory probes do not show a significant impact to the throughput of the training run (50 steps). Generally with larger batch sizes and models, the overhead of memory logging becomes insignificant.
A. <100 toks/sec slowdown after introducing the memory probes for Mistral,
gpus
device
batch
size
with
no mem probe
(toks/sec)
with
mem probe
(toks/sec)
B. <100 toks/sec slowdown after introducing the memory probes for Mixtral
gpus
device
batch
size
with
no_mem_probe
(toks/sec)
with
mem_probe
(toks/sec)
2. Torch/HF shows more granularity of memory usage with peak memory and actual allocated memory than Nvidia's reserved memory. This is more helpful when analyzing the actual memory allocated for each model.
We compare the 2 memory tracking methods (Nvidia vs Torch/HF) on single devices for both GPTQ-LoRA and QLoRA. Nvidia's
peak mem reserved
reports larger values than Torch/HFpeak mem alloc
,torch mem alloc
shows the actual memory usage is lesser.gpus
device
batch
size
(GiB)
(GiB)
(GiB)
3. Memory Usage Decreases on Distributed Finetuning
torch mem alloc
shows the memory usage decreases as the models are sharded (Comparing to table above).gpus
device
batch
size
(GiB)
(GiB)
(GiB)
Verified that
torch mem alloc
for GPTQ-LoRA on Llama2-70B hovers at 19GiB when sharded aftertrainer.prepare
and during training. The values are similar to the manual probed values from this #15.gpus
device
batch
size
(GiB)
(GiB)
(GiB)
Benchmarks
Run
tox -e run_benches
to produce benchmarks. Full benchmark details can be referenced here4. For small models, LoRA runs faster than the Quantized PEFT methods. One likely reason could be that it doesnt require an additional dequantization operation before the base layer+LoRA matmuls. While this is so, we also observe the significantly larger memory it consumed compared to the Quantized PEFT methods.
Type
Config
Type
gpus
device
batch
size
(GiB)
(GiB)
(GiB)
(toks/sec)
5. We observe that on single device finetuning for larger models (e.g. 49B Mixtral), PEFT begins to run out of memory while the Quantized PEFT methods continue to maintain low memory consumption.
Type
Config
Type
gpus
device
batch
size
(GiB)
(GiB)
(GiB)
(toks/sec)
6. In distributed finetuning for large models like Llama2-70B, GPTQ-LoRA shows the lowest memory consumption with the same throughput.
Type
Config
Type
gpus
device
batch
size
(GiB)
(GiB)
(GiB)
(toks/sec)
7. Increasing the batch size, GPTQ-LoRA is the only experiment that doesnt run out of memory.
Type
Config
Type
gpus
device
batch
size
(GiB)
(GiB)
(GiB)
(toks/sec)