EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
4.32k stars 300 forks source link

Crash if decoding parallel messages #123

Closed lucasavila00 closed 6 months ago

lucasavila00 commented 6 months ago

Running requests concurrently manually makes it crash too.

The following script also triggers it:

import OpenAI from "openai";
const openai = new OpenAI({
  baseURL: "http://localhost:1234/v1/",
  apiKey: "ignore",
});

const runOnce = async () => {
  const response = await openai.chat.completions.create({
    messages: [
      {
        role: "user",
        content: "Tell me a joke.",
      },
    ],
    model: "mistral",
  });

  console.log(response.choices[0].message.content);
};

const main = async () => {
  await Promise.all([runOnce(), runOnce(), runOnce()]);
};

main().catch((e) => {
  console.error(e);
  //@ts-ignore
  process.exit(1);
});
thread '<unnamed>' panicked at mistralrs-core/src/engine/mod.rs:293:44:
called `Result::unwrap()` on an `Err` value: ShapeMismatchCat { dim: 2, first_shape: [1, 8, 14, 128], n: 2, nth_shape: [1, 8, 13, 128] }
EricLBuehler commented 6 months ago

@lucasavila00, can you please provide an example with the python openai package? I was not yet able to reproduce it.

lucasavila00 commented 6 months ago

@EricLBuehler I could also reproduce it with a shell script parallel-calling, for instance, the regex script

python3 regex.py & python3 regex.py & python3 regex.py && fg && fg

Creating a different python script instead of regex also works, eg:

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:1234/v1/"

completion = openai.chat.completions.create(
    model="mistral",
    messages=[
        {
            "role": "user",
            "content": "Write a list of jokes. Return a markdown list where each item is a joke.",
        }
    ],
)

print(completion.choices[0].message.content)

For me, it fails 50% of the time if I do 2 parallel requests. 3 parallel requests failed all the time.

EricLBuehler commented 6 months ago

I think that this is the problem area of the code. Its purpose is to make sure that all prompt seqs have the same length, but obviously it fails:

https://github.com/EricLBuehler/mistral.rs/blob/a1330f48b9eba6d42ca02580b49020f61c3dda68/mistralrs-core/src/scheduler.rs#L132-L167

EricLBuehler commented 6 months ago

129 may fix this.

EricLBuehler commented 6 months ago

@lucasavila00, I was also able to reproduce the error even after #129.

EricLBuehler commented 6 months ago

This appears to be connected to adding prefill seqs, as the first one is added and then the next (rest) are added as prefill sequences. This is probably due to the off-by-one error causing #126.

lucasavila00 commented 6 months ago

I had cloned #129 and it was not working, even for a single request. I don't remember the exact commit.

It looks like https://github.com/EricLBuehler/mistral.rs/issues/126 if an off by one indeed.

EricLBuehler commented 6 months ago

@lucasavila00 , this should be fixed now.