Feature suggestion - Handling LLM quotas when evaluating

explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines

https://docs.ragas.io

Apache License 2.0

6.62k stars 649 forks source link

Feature suggestion - Handling LLM quotas when evaluating #554

Closed 0ENZO closed 3 months ago

0ENZO commented 7 months ago

It would be nice to handle LLM quotas when evaluating a large dataset, for my personal case I cannot increase the default 60 request per min for VertexAI LLM.

Tracking llm calls for the current minute within .evaluate() might sound a bit overkill. Offering the possibility to set a time.sleep() between each sample might do the trick.

I don't what know what you guys think. I am the only one to encounter such a problem ?

jjmachan commented 7 months ago

that is a great idea @0ENZO ! we use tenacity underneath the hood and have this to configure https://github.com/explodinggradients/ragas/blob/main/src/ragas/run_config.py things like this

I'll add sleep to it and that should help you

jjmachan commented 7 months ago

we have to add something for stats too I guess. so you can see num_tokens, cost, performance figures etc

what do you think about those, have you felt the need for that. If you were to only choose one, which would it be?

0ENZO commented 7 months ago

Sounds good, thanks !

Regarding performance figures, num_tokens.. I haven't had any such needs yet

jjmachan commented 7 months ago

hey @0ENZO so after thinking about it a bit more, it seems like a more complicated solution to implement because of how we have things setup.

The core problem here is the contention of resources, we could have fixed it in 2 ways

collect all the LLM and embedding calls ragas makes and implement something like a leaky bucket so that the #requests per minute is a constant
Exponential Backoff as explained here. We went with this.

so the solution today is configuring the exponential backoff for 60 requests per minute. Right now I don't have a good formula for that but that is something we could find right?

so the solution for your problem today is configuring the RunConfig with the correct max_retries and max_wait (and maybe some more, I'll look into that) but what do you think?

jjmachan commented 7 months ago

also I'm doing some experiments so that I can get you unblocked without much hastle

klangst-ETR commented 7 months ago

Do you have a suggestion that I could implement now? I am exceeding my azure gpt4 rate limit of 80k tokens per minute when evaluating 48 questions/answers for all metrics. Is there a way to rate limit the evaluation? Perhaps I should pull out some metrics?

bdeck8317 commented 5 months ago

@jjmachan, any suggestions on how we should set the run config? I am also facing this issue with ragas 0.1.7

xiaochaohit commented 2 months ago

we have to add something for stats too I guess. so you can see num_tokens, cost, performance figures etc

what do you think about those, have you felt the need for that. If you were to only choose one, which would it be?

May I ask if there is any plan for this part

jjmachan commented 1 month ago

will be fixed with #1156 for documentation on run_config check out Understand Cost and Usage of Operations | Ragas for how to figure out cost

hope this helps @xiaochaohit @bdeck8317 @klangst-ETR