Closed joe-barhouch closed 1 year ago
Hi @joe-barhouch , thanks for raising this issue. can you share the ragas and Python version you're using?
Ragas v 0.0.19
and python 3.8.11
Thank you @joe-barhouch , Can you also share the code used to load documents?
def get_pdf_docs(pdf_docs):
"""Get text from PDF documents"""
docs = []
pdf_reader = PdfReader(pdf_docs)
for page in pdf_reader.pages:
docs.append(
Document(
page_content=page.extract_text(),
metadata={
"page": page.page_number + 1,
},
)
)
return docs
def get_docs(documents):
doc_list = []
for doc in documents:
docs = get_pdf_docs(doc)
doc_list.append(docs)
return doc_list
def get_document_chunks(documents):
"""Split documents into chunks"""
documents = [item for sublist in documents for item in sublist]
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=3000,
chunk_overlap=100,
)
chunks = text_splitter.split_documents(documents)
return chunks
This is how i create the chunks, using PdfReader and Langchain's recursive text splitter
Hey @joe-barhouch , .generate function does not accept nodes/chunks. You should pass a list of documents instead of chunks.
Why we do not accept chunks? It is primarily to avoid chunk size bias while creating questions, and also enable synthesizing high-quality multi-document questions.
but the chunks are also List[Document] How should I do it differently? Have the entire pdf in one Document?
Still get the same error.
I have doc = Document(page_content="mypdf.pdf")
Then:
testsetgenerator = TestsetGenerator.from_default()
test_size = 1
testset = testsetgenerator.generate([doc], test_size=test_size)
Test_size 5 didn't work, i was prompted to reduce the test_size. So i changed it to 1, and i got the same error:
type object is not subscriptable
Hey, Sorry for the confusion. I have got the issue, this is because 3.8 does not support list
type as subscriptable. Can be fixed by changing this to t.List
here
Would you like to contribute? Or else I can raise a PR.
Changing doc_nodes_map: t.Dict[str, t.List[BaseNode]] = defaultdict(list[BaseNode])
to
doc_nodes_map: t.Dict[str, t.List[BaseNode]] = defaultdict(t.List[BaseNode])
Results in the error:
256 for node in documenet_nodes:
257 if node.ref_doc_id:
--> 258 doc_nodes_map[node.ref_doc_id].append(node)
260 return doc_nodes_map
File [~gen-ai/sandbox/ragas/~/.pyenv/versions/3.8.11/lib/python3.8/typing.py:727), in _GenericAlias.__call__(self, *args, **kwargs)
725 def __call__(self, *args, **kwargs):
726 if not self._inst:
--> 727 raise TypeError(f"Type {self._name} cannot be instantiated; "
728 f"use {self._name.lower()}() instead")
729 result = self.__origin__(*args, **kwargs)
730 try:
TypeError: Type List cannot be instantiated; use list() instead
Let me try and raise a fix @joe-barhouch
Thanks for the help :)
No problem. FYI it works in 3.10.
Hey @joe-barhouch , please update ragas from source once #263 is merged :) I would recommend you feed the whole document into the Ragas test generator to make the best use of it. Read more about our approach here
I'm trying to use Ragas to evaluate my Langchain app. Still trying to build it at the moment. I have a
TestsetGenerator.from_default()
, to which I'm feeding a list of Document Chunks i have created using Langchain Recursive Text Splitter. When i usefrom_default()
or create my own distribution, i get this error:->
I haven't played around much, but I expected if I copy paste the tests on the documentation I would get an answer back