Closed philschmid closed 1 year ago
bump
That can depend entirely on the number of tokens you'd like to generate, the specific model you've selected, etc. With a Cold Start, you can add ~15-30s for a response time. Here are some quick tests from the defaults of the install, with a warm start:
Prompt: "Write an Edgar Allen Poe style poem about a rabbit shaped cloud.", Penalty: 1.1, seed: 1111
25 Tokens: Billed Duration: 751607 ms
50 Tokens: Billed Duration: 759135 ms
100 Tokens: Billed Duration: 775215 ms
200 Tokens: Billed Duration: 809698 ms
400 Tokens: Billed Duration: 205096 ms
Setup provisioned concurrency and auto-scaling is a good way to reduce the cold start lantency.
If the numbers @sean-bailey shared are correct with 751607ms for 25 token i think cold start doesn't matter since 1 request takes 12.5minutes to complete.
This should be the whole request. We can stream response from LLM to the client to have a shorter TTFB.
In my tests, I got ~250ms/token inference speed, about 4 tokens per second. I used ARM64 and WizardLM-7B-uncensored.ggml.q5_1.bin model.
If the numbers @sean-bailey shared are correct with 751607ms for 25 token i think cold start doesn't matter since 1 request takes 12.5minutes to complete.
I actually think there is a copy paste issue here, because that's definitely an order of magnitude off. Dropped a decimal? Remember, the function is configured to stop at 900s, so there is no way that I could even get to the "2000" seconds from the 400 token inference. Looking a bit closer shows that the 400 token time is actually shorter than the 200 token time, which definitely is not reflective of reality.
Running again with a stopwatch, the tests real inference times (from request to response) were ~75s, ~75s, ~80s, ~85s, and ~180s, for the 20, 50, 100, 200, 400 token inferences, respectively.
What is the latency you get?