Open Tobi696 opened 1 week ago
Hello @Tobi696, Thanks for reporting this issue. Could you give more details about the dataframe size that you're trying to use as KB? For example, the number of rows, the max number of characters in a single row and the average number of characters per row.
Thanks for looking into it! I'm not that experienced in python and pandas, hope this code does what we need:
df = pd.DataFrame([d.page_content for d in documents], columns=["text"])
print(f'Number of rows: {df.shape[0]}')
print(f'Max number of characters in a single row: {df.text.str.len().max()}')
print(f'Average number of characters in a single row: {df.text.str.len().mean()}')
Number of rows: 168
Max number of characters in a single row: 286419
Average number of characters in a single row: 21418.47619047619
So these high numbers are the problem?
@Tobi696 yeah, it seems to be, because the last line of error log was:
openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 43874 tokens (43874 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}
And since the default OpenAIEmbedding
has a batch_size of 40, it will try to merge 40 rows of your dataframe.
You can try to reduce the number of characters in each row or reduce the batch_size, for example, executing the following code:
from openai import OpenAI
from giskard.llm.embeddings.openai import OpenAIEmbedding
client = OpenAI(...)
# create a custom embedding model to reduce the batch_size
embedding_model = OpenAIEmbedding(client=client, model="text-embedding-ada-002", batch_size=4)
# set the created embedding model as the default one
giskard.llm.embeddings.set_default_embedding(embedding_model)
I get the same error with a reduced batch size unfortunately...
from langchain_community.document_loaders import DirectoryLoader
from giskard.llm.client.openai import OpenAIClient
from giskard.llm.embeddings.openai import OpenAIEmbedding
from giskard.rag import generate_testset
from giskard.rag import KnowledgeBase
import pandas as pd
import giskard
from openai import OpenAI
openai_client = OpenAIClient(model="gpt-4o-mini")
giskard.llm.set_llm_api("openai")
giskard.llm.set_default_client(openai_client)
client = OpenAI()
embedding_model = OpenAIEmbedding(client=client, model="text-embedding-ada-002", batch_size=4)
giskard.llm.embeddings.set_default_embedding(embedding_model)
# documents = load your documents
loader_txt = DirectoryLoader("./website", glob="**/*.txt", show_progress=True)
loader_pdf = DirectoryLoader("./pdf", glob="**/*.pdf", show_progress=True)
documents_txt = loader_txt.load()
documents_pdf = loader_pdf.load()
documents = documents_txt + documents_pdf
df = pd.DataFrame([d.page_content for d in documents], columns=["text"])
print(f'Number of rows: {df.shape[0]}')
print(f'Max number of characters in a single row: {df.text.str.len().max()}')
print(f'Average number of characters in a single row: {df.text.str.len().mean()}')
knowledge_base = KnowledgeBase(df, embedding_model=embedding_model)
testset = generate_testset(
knowledge_base,
num_questions=5,
agent_description="Ein Chatbot, der Fragen zur Website beantwortet",
language='de',
)
testset.save("test_questions.jsonl")
@Tobi696 I managed to reproduce the same error on my side, I'll investigate and get back to you as soon as I have a solution
Hello @Tobi696, indeed it seems that the number of tokens of a single row has exceeded the model limit. You can check the number of tokens for each row executing the following code:
import tiktoken
MODEL_NAME = "text-embedding-ada-002"
def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""Returns the number of tokens in a text string."""
encoding = tiktoken.encoding_for_model(MODEL_NAME)
num_tokens = len(encoding.encode(string))
return num_tokens
df["num_tokens"] = df["text"].apply(lambda x: num_tokens_from_string(x, MODEL_NAME))
print("Max num_tokens", df["num_tokens"].max())
If this number is higher than 8192, it won't work
Issue Type
Bug
Source
source
Giskard Library Version
2.15.2
OS Platform and Distribution
macos
Python version
3.11
Installed python packages
Current Behaviour?
Standalone code OR list down the steps to reproduce the issue
Relevant log output