Closed Echozqn closed 11 months ago
@Echozqn , the hbm memory efficiency is only needed for latency estimation. The memory usage calculation does not need this value.
In my case, I used nvidia-smi
to monitor and it showed about 30GB of VRAM, but the prediction came out to only 20GB. Is there such a thing as memory efficiency for VRAM?
@Echozqn The hbm memory efficiency is defined as the percentage of theoretical memory bandwidth one can achieve, thus not relevant to the memory size.
Do you use flash attention in you run code? The default (flash_attn flag) is True. This might cause the difference between the estimated memory size and the Nvidia-smi output.
Also the memory estimates aim to be close to torch.cuda.max_memory_allocated.
Yes, I found that nvidia-smi
measured inaccurately when flash_attn
was enabled. Thanks to the author for the answer.
Describe the bug I tried with two unused hbm_memory_efficiency, 1 and 0.6 but ended up with the same value for (weight+op_state+grad+act)_memory_per_gpu. Is it possible that hbm_memory_efficiency is not working in the code?
To Reproduce Steps to reproduce the behavior:
python -m llm_analysis.analysis train --model_name /hdd/echozhou/llm-analysis/examples/llama --gpu_name a100-pcie-40gb --activation_recomputation 1 --tp_size 1 --pp_size 3 --sp_size 1 --dp_size 1 --gradient_accumulation_steps 4 -b 16 --seq_len 1400 --total_num_gpus 3 --total_num_tokens 1e12 --activation_recomputation 2 --flops_efficiency 1 --hbm_memory_efficiency 0.6 --output_dir /hdd/echozhou/llm-analysis/examples/llama/test
python -m llm_analysis.analysis train --model_name /hdd/echozhou/llm-analysis/examples/llama --gpu_name a100-pcie-40gb --activation_recomputation 1 --tp_size 1 --pp_size 3 --sp_size 1 --dp_size 1 --gradient_accumulation_steps 4 -b 16 --seq_len 1400 --total_num_gpus 3 --total_num_tokens 1e12 --activation_recomputation 2 --flops_efficiency 1 --hbm_memory_efficiency 1 --output_dir /hdd/echozhou/llm-analysis/examples/llama/test
Expected behavior The final memory consumption you get is all (weight+op_state+grad+act)_memory_per_gpu: 20.14 GB
Screenshots