FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.18k stars 548 forks source link

Update documentation to explicitly describe compatability/performance with early Pascal cards #55

Open tensiondriven opened 1 year ago

tensiondriven commented 1 year ago

This was originally a question I wanted to ask, but in the interest of not abusing Github Issues, I'm disguising it as a feature request for documentation :)

There are a couple of very inexpensive cards with large VRAM; the Tesla M40 24GB (Maxwell) and Tesla P40 24GB (Pascal). Neither of these seem to have Tensor cores, which makes them pretty useless for FP16 math - and maybe equally useless for int8/int4, I'm not sure.

What is confusing to a lot of people who are interested in running LLM's on commodity hardware is that Tesla M40 is listed as part of the "Pascal" family, and a feature of Pascal is the inclusion of FP16 processing. However, the Tesla P40 specifically lacks FP16 support and thus runs FP16 at 1/64th the performance of other Tesla Pascal series cards.

Question 1: Do you know if FlexGen will run on a P40 24GB with reasonable performance, given that it is using 8bit or 4bit math? Is it comperable to other Pascal cards in terms of performance?

Question 2: Do you know if FlexGen can split a model across multiple Tesla P40 cards? Something I read suggested that splitting the model was not possible using bitsandbytes on older cards, but I'm not clear on the reason.

For context; if it turns out that the Tesla P40, or 2-3 Tesla P40's, can give reasonable performance in the < 1 second/token range for inference on large models, it would open up a new world of possibility to individuals looking to run LLM's at home.

Ying1123 commented 1 year ago

Q1: I think FlexGen should be able to run on P40 as long as you can find compatible pytorch version. Q2: Yes. We do not use bitsandbytes. FlexGen supports distributed GPU execution. https://github.com/FMInference/FlexGen#scaling-to-distributed-gpus

Ph0rk0z commented 1 year ago

Forget maxwell. It's too slow.

With pascal (P6000) I get:

Output generated in 27.44 seconds (0.36 it/s, 80 tokens) Output generated in 26.59 seconds (0.38 it/s, 80 tokens) Output generated in 30.43 seconds (0.33 it/s, 80 tokens)

on opt 13b with --percent 85 15 0 100 0 100

It doesn't run out of memory when generating and approaches usable.

Unfortunately it seems the stop command in the API doesn't actually stop generations and the token limit is always reached.

merrymercy commented 1 year ago

Hi @Ph0rk0z, the stop argument is fixed yesterday in this commit https://github.com/FMInference/FlexGen/commit/cf90920349109205378e5253fd5e8da4fa2740c1

Could you try it again?

Ph0rk0z commented 1 year ago

I tried it and I think now it pads the tokens but does stop? When I did a git pull on flexgen, ooba completely broke so I undid it.

merrymercy commented 1 year ago

The stop argument is implemented in https://github.com/FMInference/FlexGen/blob/c33d8e0114d6b5e1e21db75a5837d86b47ea40b0/flexgen/flex_opt.py#L1030-L1031 and https://github.com/FMInference/FlexGen/blob/c33d8e0114d6b5e1e21db75a5837d86b47ea40b0/flexgen/flex_opt.py#L772-L775.

Yeah, we changed the API to make it simpler. I suggest you do a fresh clone and port your old examples.

Ph0rk0z commented 1 year ago

Can confirm it is working now.

More real world performance: Quadro P6000, AMD 1700x, 96g ram

13b Compressed: --compress-weight --percent 100 0 100 0 100 0 ALL GPU

Output generated in 40.12 seconds (0.62 it/s, 200 tokens)
Output generated in 13.13 seconds (0.76 it/s, 80 tokens)
Output generated in 14.85 seconds (0.67 it/s, 80 tokens)

13b Compressed: --compress-weight --percent 100 0 0 100 0 100 CPU

Output generated in 24.34 seconds (0.41 it/s, 80 tokens)
Output generated in 27.17 seconds (0.37 it/s, 80 tokens)
Output generated in 35.32 seconds (0.28 it/s, 80 tokens)

30b Compressed: --compress-weight --percent 100 0 100 0 100 0 ALL GPU

Output generated in 205.92 seconds (0.05 it/s, 80 tokens)
Output generated in 42.36 seconds (0.24 it/s, 80 tokens)
Output generated in 60.85 seconds (0.16 it/s, 80 tokens)

30b Compressed: --compress-weight --percent 100 0 0 100 0 100 CPU

Output generated in 62.28 seconds (0.16 it/s, 80 tokens)
Output generated in 39.61 seconds (0.25 it/s, 80 tokens)
Output generated in 30.47 seconds (0.33 it/s, 80 tokens)

13b Compressed --compress-weight --percent 0 100 100 0 100 0
Output generated in 12.53 seconds (0.80 it/s, 80 tokens)
Output generated in 10.57 seconds (0.95 it/s, 80 tokens)
Output generated in 10.50 seconds (0.95 it/s, 80 tokens)