Open Jothimalar1997 opened 2 days ago
Hey there, @Jothimalar1997! I'm here to help you with bugs, questions, and more. Let's tackle this issue together!
The BadRequestError: Unsupported data type
error usually occurs when the documents are not in the expected format. For the generate_with_llamaindex_docs
function, ensure that your documents are loaded correctly using a loader from the llama_index
library, such as SemanticScholarReader
.
Here's a quick guide to ensure your documents are in the correct format:
Load Documents Properly: Use a loader like SemanticScholarReader
to load your documents. Here's an example:
from llama_index.core import download_loader
def get_documents():
SemanticScholarReader = download_loader("SemanticScholarReader")
loader = SemanticScholarReader()
query_space = "large language models"
documents = loader.load_data(query=query_space, limit=10)
return documents
documents = get_documents()
Pass Documents to the Function: Ensure that the documents are passed correctly to the generate_with_llamaindex_docs
function:
testset = generator.generate_with_llamaindex_docs(
documents=documents,
test_size=50,
distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)
Check Document Transformation: Ensure that the Document.from_llamaindex_document
method is correctly transforming your documents. If there are specific requirements for this transformation, make sure they are met.
If you've followed these steps and the error persists, double-check the document structure and ensure all dependencies are correctly installed and compatible with your Ragas version [1].
To continue talking to Dosu, mention @dosu.
@dosu How can I load my custom pdf files using SemanticScholarReader since it is used for querying academic related paper right?
I need to upload my own pdf files and to generate testset (question + answer pair) from those files using generator from ragas.
While implementing the above steps only, It generate the above mentioned error
To load your custom PDF files using SemanticScholarReader
, you can follow these steps:
Load Your Custom PDF Files:
If SemanticScholarReader
is not suitable for non-academic PDFs, you might need to use a different loader that supports custom PDFs. However, if you still want to try using SemanticScholarReader
, ensure your query is set to match your documents.
Generate a Test Set:
Once your documents are loaded, you can generate a test set using the generator. Here's a general approach:
from llama_index import download_loader
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# Load documents
SemanticScholarReader = download_loader("SemanticScholarReader")
loader = SemanticScholarReader()
documents = loader.load_data(query="your_query", full_text=True, limit=10)
# Initialize the generator
generator_llm = ChatOpenAI(model="gpt-4o-mini")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()
generator = TestsetGenerator.from_langchain(
generator_llm,
critic_llm,
embeddings
)
# Define question type distribution
distributions = {
simple: 0.5,
multi_context: 0.4,
reasoning: 0.1
}
# Generate the test set
testset = generator.generate_with_langchain_docs(documents, 10, distributions)
testset_df = testset.to_pandas()
If SemanticScholarReader
doesn't work for your custom PDFs, consider using a different document loader that supports your file type. Ensure your documents are in the correct format before passing them to the generator [1].
To continue talking to Dosu, mention @dosu.
@dosu can you provide examples with other document loader rather than SimpleDirectoryReader?
I couldn't find specific examples in the Ragas documentation for using document loaders other than SimpleDirectoryReader
. However, you can consider using libraries like PyPDF2
or pdfplumber
to load PDF files and then convert them into a format compatible with Ragas.
Here's a basic example using PyPDF2
:
import PyPDF2
def load_pdf(file_path):
with open(file_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
text = ''
for page in reader.pages:
text += page.extract_text()
return text
# Example usage
pdf_text = load_pdf('your_file.pdf')
documents = [{'page_content': pdf_text, 'metadata': {'source': 'your_file.pdf'}}]
# Now you can pass `documents` to your Ragas generator function
This approach extracts text from a PDF and structures it in a way that can be used with Ragas' document processing functions. Adjust the metadata as needed for your use case.
To continue talking to Dosu, mention @dosu.
I have checked the https://github.com/explodinggradients/ragas/blob/main/docs/howtos/integrations/llamaindex.ipynb and related resources and couldn't resolve my bug.
When I try to generate the testset from the text file using generator, it gives the following error:
BadRequestError: Unsupported data type
Ragas version: 0.1.21 Python version: 3.12.2
Code to Reproduce testset = generator.generate_with_llamaindex_docs( documents, test_size=50, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}, )
Error trace BadRequestError Traceback (most recent call last) Cell In[17], line 2 1 # generate testset ----> 2 testset = generator.generate_with_llamaindex_docs( 3 documents, 4 test_size=50, 5 distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}, 6 )
File c:\Users\jothimalar.p.lv\Documents\LLM POCS\TestSet generation\testsetenv\Lib\site-packages\ragas\testset\generator.py:179, in TestsetGenerator.generate_with_llamaindex_docs(self, documents, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 177 distributions = distributions or {} 178 # chunk documents and add to docstore --> 179 self.docstore.add_documents( 180 [Document.from_llamaindex_document(doc) for doc in documents] 181 ) 183 return self.generate( 184 test_size=test_size, 185 distributions=distributions, (...) 189 raise_exceptions=raise_exceptions, 190 )
File c:\Users\jothimalar.p.lv\Documents\LLM POCS\TestSet generation\testsetenv\Lib\site-packages\ragas\testset\docstore.py:214, in InMemoryDocumentStore.add_documents(self, docs, show_progress) 209 # split documents with self.splitter into smaller nodes ... (...) 1641 retries_taken=retries_taken, 1642 )
Expected behavior It should generate the testset: question and answer pair