cli99 / llm-analysis

Latency and Memory Analysis of Transformer Models for Training and Inference
Apache License 2.0
343 stars 40 forks source link

[BUG]Is it possible that hbm_memory_efficiency is not working in the code? #12

Closed Echozqn closed 11 months ago

Echozqn commented 11 months ago

Describe the bug I tried with two unused hbm_memory_efficiency, 1 and 0.6 but ended up with the same value for (weight+op_state+grad+act)_memory_per_gpu. Is it possible that hbm_memory_efficiency is not working in the code?

To Reproduce Steps to reproduce the behavior:

  1. python -m llm_analysis.analysis train --model_name /hdd/echozhou/llm-analysis/examples/llama --gpu_name a100-pcie-40gb --activation_recomputation 1 --tp_size 1 --pp_size 3 --sp_size 1 --dp_size 1 --gradient_accumulation_steps 4 -b 16 --seq_len 1400 --total_num_gpus 3 --total_num_tokens 1e12 --activation_recomputation 2 --flops_efficiency 1 --hbm_memory_efficiency 0.6 --output_dir /hdd/echozhou/llm-analysis/examples/llama/test
  2. python -m llm_analysis.analysis train --model_name /hdd/echozhou/llm-analysis/examples/llama --gpu_name a100-pcie-40gb --activation_recomputation 1 --tp_size 1 --pp_size 3 --sp_size 1 --dp_size 1 --gradient_accumulation_steps 4 -b 16 --seq_len 1400 --total_num_gpus 3 --total_num_tokens 1e12 --activation_recomputation 2 --flops_efficiency 1 --hbm_memory_efficiency 1 --output_dir /hdd/echozhou/llm-analysis/examples/llama/test

Expected behavior The final memory consumption you get is all (weight+op_state+grad+act)_memory_per_gpu: 20.14 GB

Screenshots

image image
cli99 commented 11 months ago

@Echozqn , the hbm memory efficiency is only needed for latency estimation. The memory usage calculation does not need this value.

Echozqn commented 11 months ago

In my case, I used nvidia-smi to monitor and it showed about 30GB of VRAM, but the prediction came out to only 20GB. Is there such a thing as memory efficiency for VRAM?

cli99 commented 11 months ago

@Echozqn The hbm memory efficiency is defined as the percentage of theoretical memory bandwidth one can achieve, thus not relevant to the memory size.

Do you use flash attention in you run code? The default (flash_attn flag) is True. This might cause the difference between the estimated memory size and the Nvidia-smi output.

Also the memory estimates aim to be close to torch.cuda.max_memory_allocated.

Echozqn commented 11 months ago

Yes, I found that nvidia-smi measured inaccurately when flash_attn was enabled. Thanks to the author for the answer.