TestsetGenerator not working

joe-barhouch commented 1 year ago

I'm trying to use Ragas to evaluate my Langchain app. Still trying to build it at the moment. I have a TestsetGenerator.from_default(), to which I'm feeding a list of Document Chunks i have created using Langchain Recursive Text Splitter. When i use from_default() or create my own distribution, i get this error:

test_size = 10
testset = testsetgenerator.generate(chunks[:10], test_size=test_size)

->

    252 def _generate_doc_nodes_map(
    253     self, documenet_nodes: t.List[BaseNode]
    254 ) -> t.Dict[str, t.List[BaseNode]]:
--> 255     doc_nodes_map: t.Dict[str, t.List[BaseNode]] = defaultdict(list[BaseNode])
    256     for node in documenet_nodes:
    257         if node.ref_doc_id:

TypeError: 'type' object is not subscriptable

I haven't played around much, but I expected if I copy paste the tests on the documentation I would get an answer back

shahules786 commented 1 year ago

Hi @joe-barhouch , thanks for raising this issue. can you share the ragas and Python version you're using?

joe-barhouch commented 1 year ago

Ragas v 0.0.19 and python 3.8.11

shahules786 commented 1 year ago

Thank you @joe-barhouch , Can you also share the code used to load documents?

joe-barhouch commented 1 year ago

def get_pdf_docs(pdf_docs):
    """Get text from PDF documents"""
    docs = []

    pdf_reader = PdfReader(pdf_docs)
    for page in pdf_reader.pages:
        docs.append(
            Document(
                page_content=page.extract_text(),
                metadata={
                    "page": page.page_number + 1,
                },
            )
        )

    return docs

def get_docs(documents):
    doc_list = []
    for doc in documents:
        docs = get_pdf_docs(doc)
        doc_list.append(docs)
    return doc_list

def get_document_chunks(documents):
    """Split documents into chunks"""
    documents = [item for sublist in documents for item in sublist]
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=3000,
        chunk_overlap=100,
    )

    chunks = text_splitter.split_documents(documents)
    return chunks

This is how i create the chunks, using PdfReader and Langchain's recursive text splitter

shahules786 commented 1 year ago

Hey @joe-barhouch , .generate function does not accept nodes/chunks. You should pass a list of documents instead of chunks.

Why we do not accept chunks? It is primarily to avoid chunk size bias while creating questions, and also enable synthesizing high-quality multi-document questions.

joe-barhouch commented 1 year ago

but the chunks are also List[Document] How should I do it differently? Have the entire pdf in one Document?

joe-barhouch commented 1 year ago

Still get the same error. I have doc = Document(page_content="mypdf.pdf") Then:

testsetgenerator = TestsetGenerator.from_default()
test_size = 1   
testset = testsetgenerator.generate([doc], test_size=test_size)

Test_size 5 didn't work, i was prompted to reduce the test_size. So i changed it to 1, and i got the same error: type object is not subscriptable

shahules786 commented 1 year ago

Hey, Sorry for the confusion. I have got the issue, this is because 3.8 does not support list type as subscriptable. Can be fixed by changing this to t.List here

Would you like to contribute? Or else I can raise a PR.

joe-barhouch commented 1 year ago

Changing doc_nodes_map: t.Dict[str, t.List[BaseNode]] = defaultdict(list[BaseNode]) to doc_nodes_map: t.Dict[str, t.List[BaseNode]] = defaultdict(t.List[BaseNode]) Results in the error:

256 for node in documenet_nodes:
    257     if node.ref_doc_id:
--> 258         doc_nodes_map[node.ref_doc_id].append(node)
    260 return doc_nodes_map

File [~gen-ai/sandbox/ragas/~/.pyenv/versions/3.8.11/lib/python3.8/typing.py:727), in _GenericAlias.__call__(self, *args, **kwargs)
    725 def __call__(self, *args, **kwargs):
    726     if not self._inst:
--> 727         raise TypeError(f"Type {self._name} cannot be instantiated; "
    728                         f"use {self._name.lower()}() instead")
    729     result = self.__origin__(*args, **kwargs)
    730     try:

TypeError: Type List cannot be instantiated; use list() instead

shahules786 commented 1 year ago

Let me try and raise a fix @joe-barhouch

joe-barhouch commented 1 year ago

Thanks for the help :)

shahules786 commented 1 year ago

No problem. FYI it works in 3.10.

shahules786 commented 1 year ago

Hey @joe-barhouch , please update ragas from source once #263 is merged :) I would recommend you feed the whole document into the Ragas test generator to make the best use of it. Read more about our approach here

explodinggradients / ragas

TestsetGenerator not working #258