Open jonas-nothnagel opened 1 week ago
Hi @jonas-nothnagel 👋, I'm here to assist you with any bugs, questions, or contributions you have regarding the repository.
To resolve the 429 "Too Many Requests" error when using the HuggingFaceEndpointAPI with the Llama3-70B model, you can implement rate limiting and retry logic in your code. Here is an example of how you can do this in Python:
import time
import random
import logging
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
# Function to make a request to the HuggingFaceEndpointAPI
def make_request(url, headers, data, max_retries=10, max_wait_time=300):
session = requests.Session()
retry = Retry(
total=max_retries, # Total number of retries
backoff_factor=1, # Time to wait between retries
status_forcelist=[429, 500, 502, 503, 504], # Retry on these status codes
method_whitelist=["HEAD", "GET", "OPTIONS", "POST"] # Retry on these methods
)
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)
for attempt in range(max_retries):
try:
logging.info(f"Attempt {attempt + 1} of {max_retries}")
response = session.post(url, headers=headers, json=data)
response.raise_for_status()
return response.json()
except requests.exceptions.HTTPError as e:
if response.status_code == 429:
wait_time = min((2 ** attempt) + random.uniform(0, 1), max_wait_time)
logging.warning(f"Rate limit hit. Waiting {wait_time:.2f} seconds.")
time.sleep(wait_time)
else:
logging.error(f"Unexpected error: {str(e)}")
raise e
raise Exception("Max retries reached")
# Example usage
url = "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B"
headers = {"Authorization": "Bearer YOUR_API_KEY"}
data = {"inputs": "Your input data"}
response = make_request(url, headers, data)
if response:
print("Request successful")
print(response)
else:
print("Request failed")
This code sets up a retry mechanism that will automatically retry the request if it receives a 429 status code, with a backoff factor to wait between retries. Adjust the total
and backoff_factor
parameters as needed to fit your use case [1].
Additionally, you can use the RunConfig
class in the ragas
library to manage retries and rate limiting more effectively. Here is an example of how to configure it:
from ragas.run_config import RunConfig, add_async_retry, add_retry
# Configure run_config for docstore
run_config = RunConfig(max_retries=15, max_wait=90)
generator.set_run_config(run_config)
# Use the function with retry logic
testset = make_request_with_backoff()
This configuration will help manage rate limits and retries within the ragas
library [2][3].
To continue talking to Dosu, mention @dosu.
[x] I have checked the documentation and related resources and couldn't resolve my bug.
I want to create synthetic test data. Using the OpenAI or Anthropic API is very expensive so I want to use the HuggingFaceEndpointAPI and run it with Llama3-70B. I am running it on a GPU Compute Cluster and would assume that my compute is sufficient to load and run the model. However, I always run into rate limit error 429. Immediately when the script starts.
Ragas version: newest release (5th July) Python version: 3.10
Please find below the code. I tripe-checked it and it should be correct. Assume we have langchain document artifact for this of course. I excluded the data loading for better visibility.
Code to Reproduce
Error trace
I except the script to be running and use Llama3 to generate the test data.
Additional context I am running this on 6 Nvidia A10s using a Pytorch Image and Python 3.10. Specifications from the job script below: