cli99 / llm-analysis

Latency and Memory Analysis of Transformer Models for Training and Inference
Apache License 2.0
343 stars 40 forks source link

A question about layernorm activation memory. #18

Closed LinHanyueEsar closed 9 months ago

LinHanyueEsar commented 10 months ago

Hi,

The function get_memory_activation_per_layer_layernorm() will return a value of seq_len batch_size hidden_dim / sp_size * dtype_bytes, which in fp16 will be 2sbh/s.

However, I find the paper Reducing Activation Recomputation in Large Transformer Models mentions that the activation memory of LayerNorm is 4sbh.

Unfortunately, I'm not so familiar with LLM memory consumption. Since other activation memory result fits the paper, I wonder if there exists a mistake inside the paper or there is a bug in this function?

Thanks, Esar

cli99 commented 9 months ago

@LinHanyueEsar , get_memory_activation_per_layer_layernorm is per layernorm. I think the paper means the two layernorm in a transformer layer takes 4bhs in total? https://github.com/cli99/llm-analysis/blob/main/llm_analysis/analysis.py#L863 does a 2*per_layer_norm.

LinHanyueEsar commented 9 months ago

Thanks for your reply! I miss your parameter in get_memory_activation_per_layer ,

By the way, I have another question about the formula from this paper. I have used the formula to calculate the memory usage of some models. But it seems that the formula couldn't match cases mentioned in this issue . The formula 34sbh+5abs^2 fails to estimate the 60GB activation memory of GPT-2 (In DeepZero, it is 12sbhL) and 45.61GB of Llama-7b.

Have you seen these statistics before? Are there conflicts between these statistics and the formula?

cli99 commented 9 months ago

The LOMO paper mentioned the activation memory usage is derived from profiling, not detailed how it's measured. 34sbh+5abs^2 aligns well when not using memory-optimized attention implementation (e.g. flash attention).