FMInference / H2O

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
346 stars 30 forks source link

H2O is slower than full #10

Open haiasd opened 8 months ago

haiasd commented 8 months ago

I run bash scripts/streaming/eval.sh full and bash scripts/streaming/eval.sh h2o on one A100 80G GPU, while full cost 489s, h2o cost 7200s.

haiasd commented 8 months ago

I run python -m flexgen.flex_opt --gpu-batch-size 1 --overlap false --model facebook/opt-6.7b --path _DUMMY_ --prompt-len 512 --gen-len 512 and python flex_opt.py --gpu-batch-size 1 --overlap false --hh-ratio 0.2 --hh-all --model facebook/opt-6.7b --path _DUMMY_ --prompt-len 512 --gen-len 512 on one A100 80G GPU. It seems h2o is still slower than baseline. baseline: image h2o: image

chaos318 commented 8 months ago

I run bash scripts/streaming/eval.sh full and bash scripts/streaming/eval.sh h2o on one A100 80G GPU, while full cost 489s, h2o cost 7200s.

I tried on A30 and have the same conclusion with you 。 我在A30上试了一下,结果full也是快于h2o很多