429 Request Error with Langchain Huggingface Endpoint

[x] I have checked the documentation and related resources and couldn't resolve my bug.

I want to create synthetic test data. Using the OpenAI or Anthropic API is very expensive so I want to use the HuggingFaceEndpointAPI and run it with Llama3-70B. I am running it on a GPU Compute Cluster and would assume that my compute is sufficient to load and run the model. However, I always run into rate limit error 429. Immediately when the script starts.

Ragas version: newest release (5th July) Python version: 3.10

Please find below the code. I tripe-checked it and it should be correct. Assume we have langchain document artifact for this of course. I excluded the data loading for better visibility.

Code to Reproduce

import pandas as pd
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context, conditional
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_text_splitters import (
    TokenTextSplitter)

from langchain_huggingface import HuggingFaceEndpoint, HuggingFaceEmbeddings

import os
import re
import torch

import time
import random
import logging

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(f'running on device: {device}')

if __name__ == '__main__':

    def make_request_with_backoff(max_retries=10, max_wait_time=300):
        for attempt in range(max_retries):
            try:
                logging.info(f"Attempt {attempt + 1} of {max_retries}")
                testset = generator.generate_with_langchain_docs(docs, 50, distributions, raise_exceptions=False)
                logging.info(f"Successfully generated {len(testset)} items")
                return testset
            except Exception as e:
                if "429" in str(e):
                    wait_time = min((2 ** attempt) + random.uniform(0, 1), max_wait_time)
                    logging.warning(f"Rate limit hit. Waiting {wait_time:.2f} seconds.")
                    time.sleep(wait_time)
                else:
                    logging.error(f"Unexpected error: {str(e)}")
                    raise e
        raise Exception("Max retries reached")

    #load models
    llm = HuggingFaceEndpoint(
        repo_id="meta-llama/Meta-Llama-3-70B",
        task="text-generation",
        max_new_tokens=512,
        repetition_penalty=1.03,
    )

    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

    generator_llm = llm
    critic_llm = llm
    embeddings = embeddings

    generator = TestsetGenerator.from_langchain(
        generator_llm,
        critic_llm,
        embeddings
    )

    # Change resulting question type distribution
    distributions = {  # uniform distribution
        simple: 0.1,
        reasoning: 0.35,
        multi_context: 0.2,
        conditional: 0.35
    }

    # Configure logging
    logging.basicConfig(level=logging.INFO)

    # Use the function
    testset = make_request_with_backoff()

    testset.to_pandas()

    #store data
    testset.to_parquet('synthetic_data/ragas_llama3_qa.parquet')

Error trace

File "/usr/local/lib/python3.10/dist-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B')
**Expected behavior**

I except the script to be running and use Llama3 to generate the test data.

Additional context I am running this on 6 Nvidia A10s using a Pytorch Image and Python 3.10. Specifications from the job script below:

#!/bin/bash

set -xe

srun \
  --gpus=6 \
  --mem=144GB \
  --container-image=/data/enroot/nvcr.io_nvidia_pytorch_23.06-py3.sqsh \
  --container-workdir=`pwd` \
  --container-mounts=/my_path...\
  ./job.sh

Hi @jonas-nothnagel 👋, I'm here to assist you with any bugs, questions, or contributions you have regarding the repository.

To resolve the 429 "Too Many Requests" error when using the HuggingFaceEndpointAPI with the Llama3-70B model, you can implement rate limiting and retry logic in your code. Here is an example of how you can do this in Python:

import time
import random
import logging
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

# Function to make a request to the HuggingFaceEndpointAPI
def make_request(url, headers, data, max_retries=10, max_wait_time=300):
    session = requests.Session()
    retry = Retry(
        total=max_retries,  # Total number of retries
        backoff_factor=1,  # Time to wait between retries
        status_forcelist=[429, 500, 502, 503, 504],  # Retry on these status codes
        method_whitelist=["HEAD", "GET", "OPTIONS", "POST"]  # Retry on these methods
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    for attempt in range(max_retries):
        try:
            logging.info(f"Attempt {attempt + 1} of {max_retries}")
            response = session.post(url, headers=headers, json=data)
            response.raise_for_status()
            return response.json()
        except requests.exceptions.HTTPError as e:
            if response.status_code == 429:
                wait_time = min((2 ** attempt) + random.uniform(0, 1), max_wait_time)
                logging.warning(f"Rate limit hit. Waiting {wait_time:.2f} seconds.")
                time.sleep(wait_time)
            else:
                logging.error(f"Unexpected error: {str(e)}")
                raise e
    raise Exception("Max retries reached")

# Example usage
url = "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B"
headers = {"Authorization": "Bearer YOUR_API_KEY"}
data = {"inputs": "Your input data"}

response = make_request(url, headers, data)

if response:
    print("Request successful")
    print(response)
else:
    print("Request failed")

This code sets up a retry mechanism that will automatically retry the request if it receives a 429 status code, with a backoff factor to wait between retries. Adjust the total and backoff_factor parameters as needed to fit your use case ^[1].

Additionally, you can use the RunConfig class in the ragas library to manage retries and rate limiting more effectively. Here is an example of how to configure it:

from ragas.run_config import RunConfig, add_async_retry, add_retry

# Configure run_config for docstore
run_config = RunConfig(max_retries=15, max_wait=90)
generator.set_run_config(run_config)

# Use the function with retry logic
testset = make_request_with_backoff()

This configuration will help manage rate limits and retries within the ragas library ^[2]^[3].

To continue talking to Dosu, mention @dosu.

explodinggradients / ragas

429 Request Error with Langchain Huggingface Endpoint #1070