FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.18k stars 548 forks source link

Peak gpu memory use not scale linearly with the percentage of gpu usage of weight #108

Open frankxyy opened 1 year ago

frankxyy commented 1 year ago

command 1: python -m flexgen.flex_opt --model facebook/opt-30b --path _DUMMY_ --prompt-len 20 --gen-len 15 --percent 25 75 60 40 0 100 --gpu-batch-size 1 --num-gpu-batches 2 --cpu-cache-compute --debug fewer_batch peak gpu mem: 6.0679 GB

command 2: python -m flexgen.flex_opt --model facebook/opt-30b --path _DUMMY_ --prompt-len 20 --gen-len 15 --percent 30 70 60 40 0 100 --gpu-batch-size 1 --num-gpu-batches 2 --cpu-cache-compute --debug fewer_batch gpu oom

The only difference of command 2 from command 1 is the percentage of gpu usage of weight to increase from 25% to 30%.

The capacity of my gpu is 24 GB.