explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.13k stars 723 forks source link

long contexts won't be split automatically #597

Closed stydxm closed 5 months ago

stydxm commented 8 months ago

Describe the bug When evaluate with long context, an error was raised like this:

openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 4097 tokens. However, your messages resulted in 6402 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

Ragas version: 0.1.0 Python version: 3.10

Code to Reproduce

from datasets import Dataset
from langchain_openai.chat_models import ChatOpenAI
from ragas import evaluate
from ragas.metrics import faithfulness

gpt = ChatOpenAI(model_name="gpt-3.5-turbo")
dataset = {"question": [], "answer": [], "contexts": [], "ground_truths": []}  # hide custom dataset here, the contexts should be long enough to reproduce this
dataset = Dataset.from_dict(dataset)
result = evaluate(dataset, metrics=[faithfulness], llm=gpt)

Error trace It won't help, I think

Expected behavior I think the evaluate function should split contexts longer than the llm could process. For custom models, also a param for specifying its context window. If contexts longer than context window is not the expected usage, maybe the error could be raised by this package before sending the data to llms.

shahules786 commented 8 months ago

If contexts longer than context window is not the expected usage, maybe the error could be raised by this package before sending the data to llms.

I think this makes the most sense, as if we split the context it might have some unintended effect on the scores. I think in this situation a warning should be raised and score for the particular row must be NaN.

@stydxm What do you think? Let me know if you like to work on this

stydxm commented 8 months ago

Hi @shahules786 ,
I think this does is a problem, but actually I have no idea how to deal with it properly. Maybe this should be further discussed.
For too long contexts, splitting it automatically may cause effects on the scores. But on the other hand, refuse to evaluate on these long contexts may restrict the usage of this package. So we need more people's opinions to make a choices.
Or there is another possibility that contexts does not expect long texts and I did not truly understand it due to my bad English level. As for my dataset, I put the whole passage into contexts, questions into question and their answers into ground_truths, and responses from my rag program into answer. I manually splitted the passages in my rag program so it won't raise such errors. If my understanding is wrong, I will appreciate it if you could tell me it's a correct usage.

shahules786 commented 8 months ago

@stydxm IMO, This error pops mainly in users who use gpt-3.5-turbo which has 4k context length. 99% of the users get's this solved by using the 16k version. Can you try that out?

stydxm commented 8 months ago

@stydxm IMO, This error pops mainly in users who use gpt-3.5-turbo which has 4k context length. 99% of the users get's this solved by using the 16k version. Can you try that out?

I tested it and find most datas could be evaluate but a very small number of datas still couldn't be put in the context window.
I use Google's Natural Questions as my dataset, in which each single data is a Wikipedia page, so it's normal that some context is very long.
With this results I think raise an error or make the score NaN and raise warning at the same time it is the better choice.
Anyway, in current version, this error interrupts the evaluation process. I don'tt think it is appropriate.

joy13975 commented 8 months ago

I also get context length errors occasionally due to some outlier documents that are simply too long. It'd be nice if there is an option to control what to do when context is too long before the prompt is sent to the LLM.

anthonyivn2 commented 7 months ago

@shahules786 is the behaviour you are describing above implemented by setting raiseexceptions=False in evaluate()?