explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
6.88k stars 687 forks source link

When I want to generate data in Chinese, this error message appears. What should I do? #1488

Open Z-oo883 opened 1 day ago

Z-oo883 commented 1 day ago

ragas 0.1.21, python3.9 code: import nest_asyncio nest_asyncio.apply() from ragas.testset import TestsetGenerator from ragas.testset.evolutions import simple, reasoning, multi_context from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain.embeddings import HuggingFaceEmbeddings from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("xx.pdf") documents = loader.load_and_split() print(documents) for document in documents: document.metadata['filename'] = document.metadata['source']

generator with openai models

generator_llm = ChatOpenAI( model="Qwen2", temperature=0.3, openai_api_key="xxx", openai_api_base='xxx', stop=['<|im_end|>'] ) critic_llm = ChatOpenAI( model="Qwen2", temperature=0.3, openai_api_key="xxx", openai_api_base='xxx', stop=['<|im_end|>'] ) embedding_model_name = "\embedding\bge-large-zh-v1.5" embedding_model_kwargs = {'device': 'cpu'} embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True}

embed_model = HuggingFaceEmbeddings( model_name=embedding_model_name, model_kwargs=embedding_model_kwargs, encode_kwargs=embedding_encode_kwargs )

generator = TestsetGenerator.from_langchain( generator_llm, critic_llm, embed_model )

language = "chinese" generator.adapt(language, evolutions=[simple, reasoning,multi_context],cache_dir="a_path") generator.save(evolutions=[simple, reasoning, multi_context],cache_dir="a_path")

generate testset

testset = generator.generate_with_langchain_docs(documents, test_size=1, distributions={ simple: 0.5, reasoning: 0.25, multi_context: 0.25 },with_debugging_logs=True) df = testset.to_pandas() print(testset) df.to_csv("test_set.csv",index=False,encoding='utf-8_sig')

error: Traceback (most recent call last): File "D:\RA_LLM\pythonProject\generate_test_data.py", line 55, in generator.adapt(language, evolutions=[simple, reasoning,multi_context],cache_dir="a_path") File "D:\anaconda\envs\ragas\lib\site-packages\ragas\testset\generator.py", line 340, in adapt self.docstore.extractor.adapt(language, cache_dir=cache_dir) File "D:\anaconda\envs\ragas\lib\site-packages\ragas\testset\extractor.py", line 61, in adapt self.extractor_prompt = self.extractor_prompt.adapt( File "D:\anaconda\envs\ragas\lib\site-packages\ragas\llms\prompt.py", line 185, in adapt self_cp = self._load(language, self.name, cache_dir) File "D:\anaconda\envs\ragas\lib\site-packages\ragas\llms\prompt.py", line 286, in _load return cls(**json.load(open(path))) File "D:\anaconda\envs\ragas\lib\site-packages\pydantic\v1\main.py", line 341, in init raise validation_error pydantic.v1.error_wrappers.ValidationError: 1 validation error for Prompt root output in example 1 is not in valid json format: Expecting value: line 1 column 1 (char 0) (type=value_error)

dosubot[bot] commented 1 day ago

To resolve the "1 validation error for Prompt" when generating data in Chinese, ensure that the output in your examples is in a valid JSON format. The error indicates that the output field in one of your examples is not properly formatted as JSON.

Here's what you can do:

  1. Check the Examples: Ensure that each example in your Prompt class has a valid JSON structure for the output. If the output is a string, it should be properly quoted.

  2. Validate JSON: Use a JSON validator to check the format of your examples.

Here's an example of how your Prompt object should be structured:

from ragas.llms.prompt import Prompt

prompt = Prompt(
    name="chinese_translation",
    instruction="Translate the following text to Chinese.",
    output_format_instruction="",
    examples=[
        {
            "translate_to": "chinese",
            "input": "What is the capital of China?",
            "output": "中国的首都是哪里?"  # Ensure this is a valid JSON string
        }
    ],
    input_keys=["translate_to", "input"],
    output_key="output",
    output_type="str",
    language="chinese"
)

Make sure all fields are correctly filled and the output is a valid JSON string [1].

To continue talking to Dosu, mention @dosu.