Support batching for custom models

alexkreidler commented 2 months ago

Is your feature request related to a problem? Please describe. I'm trying to evaluate a local LLM model using Exllamav2 and Deepbench's support for the MMLU dataset. Unfortunately the current custom model class doesn't have a way to pass a batch of questions to the model runner which means it is not making good use of the hardware.

Describe the solution you'd like Let generate also take a list of strings and return a list of strings, and have an option in benchmark.evaluate to set the batch size.

Here's my code

from deepeval.benchmarks.mmlu.mmlu import MMLU
from pandas import DataFrame

from exllamav2 import *
from exllamav2.generator import *

print("Loading model...")

config = ExLlamaV2Config("/mnt/data/textgenmodels/LoneStriker_OpenHermes-2-Mistral-7B-4.0bpw-h6-exl2/")
config.max_seq_len = 4096

from deepeval.models.base_model import DeepEvalBaseLLM

class Mistral7B(DeepEvalBaseLLM):
    def __init__(
        self
    ):
        self.model = ExLlamaV2(config)
        cache = ExLlamaV2Cache(self.model, lazy = True)
        self.model.load_autosplit(cache)

        tokenizer = ExLlamaV2Tokenizer(config)
        self.generator = ExLlamaV2StreamingGenerator(self.model, cache, tokenizer)
        self.generator.set_stop_conditions([tokenizer.eos_token_id])

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        out = self.generator.generate_simple(prompt, ExLlamaV2Sampler.Settings(), 256, seed = 1234)
        print(f"""Input: {prompt}
Output {out}
""")
        return out

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def get_model_name(self):
        return "Mistral 7B EXL2"

benchmark = MMLU()

results = benchmark.evaluate(model=Mistral7B())
print(results)
print(benchmark.predictions)
pred: DataFrame = benchmark.predictions
pred.to_csv("./predictions.csv")

penguine-ip commented 2 months ago

Hey @alexkreidler thanks for the suggestion. Can you show how you can do this using your Mistral example? An example to show how to take in the batch of strings and how you return it would be helpful for us to implement this interface. Also feel free to implement it yourself if it is faster than way

Falk358 commented 2 months ago

Hi @penguine-ip ,

just chiming in that i would be interested in this feature as well. Im working with the official mistral repository: https://github.com/mistralai/mistral-src and using a pruned version of the orginal mistral model. In their main, the implement a generate() method which can take a list of prompts as input. Therefore, an interface something like this would be very useful:

class CustomMistral(DeepEvalBaseLLM):
     def __init__(self, model):
         self.model = model
     ...
     ...
     ...
    def generate(self, prompts: list) -> list:
        ...
        results = model.generate(prompts)
        ... # format results correctly
        return results

Being able to generate responses by using batched requests in each forward pass significantly reduces compute time (depending on the size of the batch). In my own reference implementation of MMLU eval a batch size of 8 reduced compute time by a factor of 7, compared to a batch size of 1.

penguine-ip commented 2 months ago

@kritinv I think there could be a generate_batch() just for the benchmarks?

@Falk358 Are there limits to the batch size for your mistral example?

Falk358 commented 2 months ago

@penguine-ip As far as I know, mistral's generate method doesn't impose any batch size limits on the user. The underlying pytorch code will throw a CudaOutOfMemoryError if GPU Memory is full. In my concrete case, im running an rtx 3090, which means that the max batch size i can use is 8 ( 8 llm requests per forward pass).

I think a seperate generate_batch() sounds like a very good solution. It would definetly be compatible with my use case, provided I can control the batch size passed to it in the parameter somehow.

penguine-ip commented 1 month ago

@Falk358 and @alexkreidler, this took a lot longer than i thought, but it is out: https://docs.confident-ai.com/docs/benchmarks-introduction#create-a-custom-llm

Can you please check if it is working (latest release v0.21.43) and whether the example in the docs is correct? Thanks!

Falk358 commented 1 month ago

Hi @penguine-ip,

thanks for the swift implementation! Unfortunately, there seems to be a problem with the evaluation for my generate() function when using release v021.43. I was using v021.36 previously. This is my generatemethod in v0.21.36:

 def generate(self, prompt: str) -> str:
        model = self.load_model()
        final_prompt = f"[INST]{prompt}[/INST]"

        result, _ = generate(
            prompts=[final_prompt],
            model=model,
            tokenizer=self.tokenizer,
            max_tokens=self.max_tokens,
            temperature=self.temperature,
        )
        answer = result[0].rsplit(sep="[/INST]", maxsplit=1)[1]
        answer = answer.strip()
        if len(answer) == 0:
            answer = " "  # return non-zero string to avoid crashes during eval
        return answer

this code calls the main.generate() function from https://github.com/mistralai/mistral-src, which returns a list of strings (called result in the code above). Each entry in this list has the following format: "[INST]prompt_passed_to_model[\INST]answer_of_model". Therefore, i split the answer after the closing "[\INST]" token and take the rest to obtain the model answer. The answer is also stip()ed to avoid leading whitespace being evaluated further down the pipeline. My model generates further explanation to the answer which i keep. In v0.21.36 this did not break evaluation on the "high_school_european_history" subset of MMLU (i get an accuracy of 0.6 for my experiment), in v0.21.43 this breaks and i get an accuracy of 0. All of this broke without any other changes to the class, I did not start writing batch_generate().

Is my code flawed in some way or is this a bug?

Thanks so much for your help!

Max

penguine-ip commented 1 month ago

Hey @Falk358 , hard to tell immediately just from looking, do you have a forked version? I can show you where to add a single line of print statement to know if this is the expected behavior, let me know!

Falk358 commented 1 month ago

Hi @penguine-ip,

https://github.com/Falk358/mistral-src is the forked repo. You can find my implementation in the file mistral_wrapper_lm_eval.py.

It would be helpful to have some more exact specification what kind of answer format (beyond it being a string) the generate method should return, I could not find anything in deepeval's docs so far. For example, maximum length of the string, whether it should just have certain formatting. I believe that the way deepeval evaluates MMLU changed in the new release version. In the old version it probably was able to extract "A", "B" "C" or "D" from the return value of generate(), while this doesnt seem to be the case anymore.

I checked the contents of answer in v0.21.43 and it looked somewhat like this: "A. Text from prompt after option A\n\n Some more content", where the second sentence was cut short due to my max_tokens limit. This is expected behaviour (it had the same layout in v0.21.36, where it still evaluated correctly). Hope this helps!

Kind Regards, Max

confident-ai / deepeval

Support batching for custom models #702