cli99 / llm-analysis

Latency and Memory Analysis of Transformer Models for Training and Inference
Apache License 2.0
343 stars 40 forks source link

BUG Fix #6

Closed 9tong closed 1 year ago

9tong commented 1 year ago

decode_activation_memory_per_gpu should be equal to decode_activation_memory_per_layer * num_layers_per_gpu

cli99 commented 1 year ago

@9tong thanks for the PR. Was out for a few weeks. For inference, we would like to reuse the tensor memory across layers, thus it's not multiplied by the number of layers. Let me know if it does not make sense.