explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.07k stars 717 forks source link

Testset generation ValueError: invalid literal for int() with base 10: #966

Open choshiho opened 5 months ago

choshiho commented 5 months ago

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug

testset = TestsetGenerator.generate_with_langchain_docs() , The program encountered an issue and terminated unexpectedly.
if int(i) - 1 < len(current_nodes.nodes)
       ^^^^^^
ValueError: invalid literal for int() with base 10:

Ragas version: 0.1.7 Python version: 3.11.7

Code to Reproduce First, I deployed my Qwen1.5-7B-Chat-GPTQ-Int8 using the following command:

CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-7B-Chat-GPTQ-Int8 --model /home/zhifeng.zhao/.cache/modelscope/hub/qwen/Qwen1___5-7B-Chat-GPTQ-Int8 --max-model-len 18576

Then, The code in the jupyter notebook is as follows :

chat = ChatOpenAI(
    # streaming=True,
    verbose=True,
    openai_api_key='EMPTY',
    openai_api_base='http://localhost:8000/v1',
    model_name="Qwen1.5-7B-Chat-GPTQ-Int8",
    temperature=0.0,
    max_tokens=2048, # Maximum number of tokens to generate.
    openai_proxy='',
)
embedding_function = SentenceTransformerEmbeddings(model_name="/home/zhifeng.zhao/.cache/modelscope/hub/AI-ModelScope/bge-small-en-v1___5")

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

langchain_llm = LangchainLLMWrapper(chat)
langchain_embeddings = LangchainEmbeddingsWrapper(embedding_function)

from ragas.testset.generator import TestsetGenerator

generator_llm = LangchainLLMWrapper(llm)
critic_llm = langchain_llm

# generator with custom llm and embeddings
generator = TestsetGenerator.from_langchain(
    generator_llm=chat,
    critic_llm=langchain_llm,
    embeddings=langchain_embeddings,
) 

# default extractor
from ragas.testset.extractor import KeyphraseExtractor
from langchain.text_splitter import TokenTextSplitter
# default DocumentStore
from ragas.testset.docstore import InMemoryDocumentStore

# init the DocumentStore with your own llm and embeddings
splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=100)
keyphrase_extractor = KeyphraseExtractor(llm=langchain_llm)
docstore = InMemoryDocumentStore(
    splitter=splitter,
    embeddings=langchain_embeddings,
    extractor=keyphrase_extractor,
)

from langchain_openai import ChatOpenAI
import os

from ragas.testset.prompts import (
    context_scoring_prompt,
    evolution_elimination_prompt,
    filter_question_prompt,
)
from langchain_community.document_loaders import DirectoryLoader
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

# remove demonstrations from examples
for prompt in [
    context_scoring_prompt,
    evolution_elimination_prompt,
    filter_question_prompt,
]:
    prompt.examples = []

from ragas.testset.filters import QuestionFilter, EvolutionFilter, NodeFilter

qa_filter = QuestionFilter(langchain_llm, filter_question_prompt)
node_filter = NodeFilter(langchain_llm, context_scoring_prompt=context_scoring_prompt)
evolution_filter = EvolutionFilter(langchain_llm, evolution_elimination_prompt)

distributions = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}

# customise the filters
from ragas.testset.evolutions import ComplexEvolution

for evolution in distributions:
    if evolution.question_filter is None:
        evolution.question_filter = qa_filter
    if evolution.node_filter is None:
        evolution.node_filter = node_filter

    if isinstance(evolution, ComplexEvolution):
        if evolution.evolution_filter is None:
            evolution.evolution_filter = evolution_filter

loader = DirectoryLoader("/home/zhifeng.zhao/prompt-engineering-guide-papers", glob="*.pdf")
documents = loader.load()

for document in documents:
    document.metadata["filename"] = document.metadata["source"]

documents = [doc for doc in documents if len(doc.page_content.split()) > 5000]

# generator = TestsetGenerator.with_openai(chunk_size=512)
testset = generator.generate_with_langchain_docs(
    documents[:10],
    test_size=10,
    raise_exceptions=False,
    with_debugging_logs=False,
    distributions=distributions,
)

Error trace

Runner in Executor raised an exception
Traceback (most recent call last):
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/executor.py", line 79, in _aresults
    r = await future
        ^^^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/asyncio/tasks.py", line 615, in _wait_for_one
    return f.result()  # May raise f.exception().
           ^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/executor.py", line 38, in sema_coro
    return await coro
           ^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/executor.py", line 112, in wrapped_callable_async
    return counter, await callable(*args, **kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/testset/evolutions.py", line 144, in evolve
    return await self.generate_datarow(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/testset/evolutions.py", line 210, in generate_datarow
    selected_nodes = [
                     ^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/testset/evolutions.py", line 213, in <listcomp>
    if int(i) - 1 < len(current_nodes.nodes)
       ^^^^^^
ValueError: invalid literal for int() with base 10: 'A: Adam bought 2 boxes of chocolate candy and 5 boxes of caramel candy. If each box has 4 pieces inside it, how much candy did he have total?'
Runner in Executor raised an exception
Traceback (most recent call last):
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/executor.py", line 79, in _aresults
    r = await future
        ^^^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/asyncio/tasks.py", line 615, in _wait_for_one
    return f.result()  # May raise f.exception().
           ^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/executor.py", line 38, in sema_coro
    return await coro
           ^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/executor.py", line 112, in wrapped_callable_async
    return counter, await callable(*args, **kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/testset/evolutions.py", line 144, in evolve
    return await self.generate_datarow(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/testset/evolutions.py", line 210, in generate_datarow
    selected_nodes = [
                     ^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/testset/evolutions.py", line 213, in <listcomp>
    if int(i) - 1 < len(current_nodes.nodes)
       ^^^^^^
ValueError: invalid literal for int() with base 10: '1. In the context of the model PaLM-540B, self-consistency aids in error repair by ensuring that reasoning paths generated by the model remain coherent and consistent with the ground truth. This is d
Runner in Executor raised an exception
Traceback (most recent call last):
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/executor.py", line 79, in _aresults
    r = await future
        ^^^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/asyncio/tasks.py", line 615, in _wait_for_one
    return f.result()  # May raise f.exception().
           ^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/executor.py", line 38, in sema_coro
    return await coro
           ^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/executor.py", line 112, in wrapped_callable_async
    return counter, await callable(*args, **kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/testset/evolutions.py", line 144, in evolve
    return await self.generate_datarow(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/testset/evolutions.py", line 210, in generate_datarow
    selected_nodes = [
                     ^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/testset/evolutions.py", line 213, in <listcomp>
    if int(i) - 1 < len(current_nodes.nodes)
       ^^^^^^
ValueError: invalid literal for int() with base 10: 'A: Let’s think step by step. Adam bought 2 boxes of chocolate candy and 5 boxes of caramel candy. Each box of candy has 4 pieces inside it. So, Adam bought 10 pieces of candy. Therefore, the answer (
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Runner in Executor raised an exception
Traceback (most recent call last):
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/executor.py", line 79, in _aresults
    r = await future
        ^^^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/asyncio/tasks.py", line 615, in _wait_for_one
    return f.result()  # May raise f.exception().
           ^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/executor.py", line 38, in sema_coro
    return await coro
           ^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/executor.py", line 112, in wrapped_callable_async
    return counter, await callable(*args, **kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/testset/evolutions.py", line 144, in evolve
    return await self.generate_datarow(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/testset/evolutions.py", line 210, in generate_datarow
    selected_nodes = [
                     ^
  File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/testset/evolutions.py", line 213, in <listcomp>
    if int(i) - 1 < len(current_nodes.nodes)
       ^^^^^^
ValueError: invalid literal for int() with base 10: '2. Adam bought 2 boxes of chocolate candy and 5 boxes of caramel candy. If each box has 4 pieces inside it, how much candy did he have total? (GT : 28)'
Failed to parse output. Returning None.

Expected behavior TestsetGenerator.generate_with_langchain_docs() returns a TestDataset object with 10 elements.

Additional context I have edited File "/home/zhifeng.zhao/anaconda3/lib/python3.11/site-packages/ragas/testset/evolutions.py" as issues #900: selected_nodes = [ current_nodes.nodes[int(i) - 1] for i in relevant_context_indices if int(i) - 1 < len(current_nodes.nodes) ]

jjmachan commented 5 months ago

This seems like a specific issue with the model and how we are parsing the inputs. This is something we are aware of and will be fixing in the coming weeks but sadily its not a an easy fix

The easier fix is to use a model that is a bit more capable. I was curious why you are not using models like GPT4 and cluade models for your usecase?

choshiho commented 4 months ago

This seems like a specific issue with the model and how we are parsing the inputs. This is something we are aware of and will be fixing in the coming weeks but sadily its not a an easy fix

The easier fix is to use a model that is a bit more capable. I was curious why you are not using models like GPT4 and cluade models for your usecase?

Thank you for your reply! Because I want to find some open source large language models to support our private deployment scenarios. Do ragas have a supported list of open source LLMs to choose from as a critic model, or can we select one from the open source LLMs list and use it for test set generation?

jjmachan commented 4 months ago

the recommendation is to try out something as powerful as GPT4, because at that scale models are much more stirable with prompts.

something else you can also try is our custom model for critic. https://docs.ragas.io/en/stable/howtos/customisations/ragas_custom_model.html

if you want help using it and setting it up let me know, can help @choshiho

choshiho commented 4 months ago

the recommendation is to try out something as powerful as GPT4, because at that scale models are much more stirable with prompts.

something else you can also try is our custom model for critic. https://docs.ragas.io/en/stable/howtos/customisations/ragas_custom_model.html

if you want help using it and setting it up let me know, can help @choshiho

Thank you! It seems like the ragas official critic model from url(https://docs.ragas.io/en/stable/howtos/customisations/ragas_custom_model.html) can't handle Chinese. Is there any open source Chinese critic model to recommend to me ?

jjmachan commented 4 months ago

Unfortunately, there is no open-source Chinese critic model at present. The best case would be to use a proprietary model that follows Chinese (gpt4, Claude etc), maybe using Azure OpenAI could help.

alternatively, we might be able to help you fine-tune a model but that will be something we will be doing custom for you so we will have to charge for that.

what do you think?