What is the latency? - Githubissues

baileytec-labs / llama-on-lambda

Deploy llama.cpp compatible Generative AI LLMs on AWS Lambda!

Other

163 stars 25 forks source link

What is the latency? #1

Closed philschmid closed 1 year ago

philschmid commented 1 year ago

What is the latency you get?

DoctorSlimm commented 1 year ago

bump

sean-bailey commented 1 year ago

That can depend entirely on the number of tokens you'd like to generate, the specific model you've selected, etc. With a Cold Start, you can add ~15-30s for a response time. Here are some quick tests from the defaults of the install, with a warm start:

Prompt: "Write an Edgar Allen Poe style poem about a rabbit shaped cloud.", Penalty: 1.1, seed: 1111

25 Tokens: Billed Duration: 751607 ms

50 Tokens: Billed Duration: 759135 ms

100 Tokens: Billed Duration: 775215 ms

200 Tokens: Billed Duration: 809698 ms

400 Tokens: Billed Duration: 205096 ms

bnusunny commented 1 year ago

Setup provisioned concurrency and auto-scaling is a good way to reduce the cold start lantency.

philschmid commented 1 year ago

If the numbers @sean-bailey shared are correct with 751607ms for 25 token i think cold start doesn't matter since 1 request takes 12.5minutes to complete.

bnusunny commented 1 year ago

This should be the whole request. We can stream response from LLM to the client to have a shorter TTFB.

In my tests, I got ~250ms/token inference speed, about 4 tokens per second. I used ARM64 and WizardLM-7B-uncensored.ggml.q5_1.bin model.

sean-bailey commented 1 year ago

If the numbers @sean-bailey shared are correct with 751607ms for 25 token i think cold start doesn't matter since 1 request takes 12.5minutes to complete.

I actually think there is a copy paste issue here, because that's definitely an order of magnitude off. Dropped a decimal? Remember, the function is configured to stop at 900s, so there is no way that I could even get to the "2000" seconds from the 400 token inference. Looking a bit closer shows that the 400 token time is actually shorter than the 200 token time, which definitely is not reflective of reality.

Running again with a stopwatch, the tests real inference times (from request to response) were ~75s, ~75s, ~80s, ~85s, and ~180s, for the 20, 50, 100, 200, 400 token inferences, respectively.