EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
2.95k stars 213 forks source link

Streaming in Tokens is significantly slower than not streaming them in #99

Closed LLukas22 closed 2 months ago

LLukas22 commented 2 months ago

If streaming is enabled the generation slows down significantly.

The following script:

Runs = 4

def request(stream:bool):
    client = openai.Client(api_key="foobar", base_url=ENDPOINT)
    return client.chat.completions.create(
            model="mistral",
            messages=[{"role": "user", "content": "What is the meaning of life?"}],
            stream=stream,
            max_tokens=200,
            temperature=0.0
        )

def run():
    for run in range(Runs):
        print("\nCompletion: ")
        print("="*15)

        now = datetime.now()
        answer = request(stream=False)
        finished = datetime.now()

        usage = answer.usage.completion_tokens      
        token_per_sek = usage / (finished-now).total_seconds()
        print(f"Duration: {finished-now}")
        print(f"Tokens per second: {token_per_sek}")

        print("\nStreaming: ")
        print("="*15)

        now = datetime.now()
        counter = 0
        stream = request(stream=True)
        for message in stream:
            counter += 1 
        finished = datetime.now()

        print(f"Duration: {finished-now}")
        token_per_sek = counter / (finished-now).total_seconds()
        print(f"Tokens per second: {token_per_sek}") 

if __name__ == "__main__":
    run()

Produces the following output, when run against a server hosted on a single L40:

Completion:
===============
Duration: 0:00:01.335570
Tokens per second: 67.38695837732205

Streaming:
===============
Duration: 0:00:02.141581
Tokens per second: 42.02502730459413

These are the results from an ollama server running on the same hardware:

Completion:
===============
Duration: 0:00:01.338007
Tokens per second: 84.45396772961577

Streaming:
===============
Duration: 0:00:01.380666
Tokens per second: 81.84455907511303
EricLBuehler commented 2 months ago

I was getting 93% of llama.cpp T/s last night (72 vs 79) on an A10 for a completion request. When I was benchmarking the sampling code, the finish_seq function adds 7ms to the normal sampling time of 2ms (finish_seq is called in the sampling process). I think that this may be caused by the use of a Responder to send the data across threads, which takes significantly longer. I think the only other way to avoid having 2 threads (one for the server, one for the engine) is to rework the server code significantly. I would assume that the ollama/llama.cpp server does not rely on multithreading to do streaming.

If streaming adds 7ms per token, that would explain the precipitous drop in performance! I wonder if we could use a faster method to send the response? As a side note, we provide the avg_compl_tok_per_sec field in the usage for a completion response for a more precise reading.

LLukas22 commented 2 months ago

I was getting 93% of llama.cpp T/s last night (72 vs 79) on an A10 for a completion request. When I was benchmarking the sampling code, the finish_seq function adds 7ms to the normal sampling time of 2ms (finish_seq is called in the sampling process). I think that this may be caused by the use of a Responder to send the data across threads, which takes significantly longer. I think the only other way to avoid having 2 threads (one for the server, one for the engine) is to rework the server code significantly. I would assume that the ollama/llama.cpp server does not rely on multithreading to do streaming.

llama.cpp is probably using a single thread for it's streamed responses as it is strictly concurrent. I also ran some tests against TGI and it doesn't have have the same performance drop when streaming is active. They also use a rust frontend for their batching and tokenization and distribute across multiple threads. They seam to use tokios async streams to move their predictions back to the sse thread. Maybe there is another way we could use to improve the latency between the engine and server threads. If i find some time i will look further into this.

TGI results:

Completion:
===============
Duration: 0:00:03.730542
Tokens per second: 40.2086345630206

Streaming:
===============
Duration: 0:00:03.437420
Tokens per second: 40.146388861413506

As a side note, we provide the avg_compl_tok_per_sec field in the usage for a completion response for a more precise reading.

Good to know 👍

EricLBuehler commented 2 months ago

Thanks, if you find any points that could be causing the slowness I would be very curious. One of the reasons llama.cpp/ollama is so fast considering the entire request (for completion), is because they implement prompt batching while we do not. So while it looks like their T/s for the entire request is much higher, we are actually faster than it seems. I think prompt batching is something important to add, so I will take a look at it.

EricLBuehler commented 2 months ago

I was reading over the code again, and it looks like we decode multiple times during a streaming pass:

First, during sampling:

https://github.com/EricLBuehler/mistral.rs/blob/4b7b2639d12091bdbfe124c22957449d51a0b513/mistralrs-core/src/sampler.rs#L198-L201

And then again in get_delta:

https://github.com/EricLBuehler/mistral.rs/blob/4b7b2639d12091bdbfe124c22957449d51a0b513/mistralrs-core/src/sequence.rs#L294

lucasavila00 commented 2 months ago

@EricLBuehler I think it's this clone here https://github.com/EricLBuehler/mistral.rs/blob/master/mistralrs-core/src/engine/mod.rs#L157

lucasavila00 commented 2 months ago
Completion: 
===============
Duration: 0:00:02.520320
Tokens per second: 55.94527679024886

Streaming: 
===============
Duration: 0:00:03.065591
Tokens per second: 45.994393903165815

After recent changes we went from ~62% efficiency to ~82% efficiency. Ollama is at 97%.

I wonder if they send an event for every token as we do. I don't think we need to send an event for every token...

I'm trying to profile the code too, to see if there's any low hanging fruit.