explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.83k stars 676 forks source link

How to adapt language of result just exactly the same to the contexts given which is separated into several files and the files are in different languages. #1435

Open francescofan opened 3 days ago

francescofan commented 3 days ago

[ ] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question As the document of ragas says in "Automatic language adaptation"(https://docs.ragas.io/en/v0.1.1/howtos/applications/use_prompt_adaptation.html#language-adaptation-for-metrics),we can adapt the language to the document using adapt Function. But I suppose that the document shows that I have to set the language in advance, because I have to know the language of the context I submit. But the actual situation is that I have some papers or documents in different language such as English, Japanese, Chinese, French. I would like to get a set of questions and ground-truth results which is written in the exactly the same language with the context I submit. Is that possible? If so, ,how to achieve it?

Code Examples This community speaks code. Share your code snippets to help us understand your question better.

Additional context Anything else you want to share with us?

dosubot[bot] commented 3 days ago

To adapt the language of questions and ground-truth results to match the language of input documents in multiple languages using the Ragas framework, you can use the adapt function. However, the Ragas framework does not have a built-in feature to automatically detect the language of input documents. You will need to manually specify the language for each document when using the adapt function.

Here's a basic example of how you can use the adapt function:

from ragas.metrics import (
    faithfulness,
    answer_correctness,
)
from langchain.chat_models import ChatOpenAI
from ragas import adapt

# Specify the language model to use for adaptation
openai_model = ChatOpenAI(model_name="gpt-4")

# Adapt the metrics to the target language
# Replace "target_language" with the actual language code for each document
adapt(metrics=[faithfulness, answer_correctness], language="target_language", llm=openai_model)

In your case, since the documents are in different languages, you would need to run the adapt function separately for each document, specifying the correct language code (e.g., "en" for English, "ja" for Japanese, "zh" for Chinese, "fr" for French) for each one.

If you want to automate the language detection process, you might consider integrating an external language detection library to determine the language of each document before calling the adapt function. Unfortunately, Ragas does not provide utilities for integrating such libraries directly, so you would need to handle this part separately.

To continue talking to Dosu, mention @dosu.

francescofan commented 3 days ago

@dosu can you list all the language parameter values could be set?

dosubot[bot] commented 3 days ago

The language parameter values you can set typically correspond to standard language codes. Here are some common ones you might use:

These codes follow the ISO 639-1 standard. If you need a language not listed here, you can usually find its code in the ISO 639-1 list.

To continue talking to Dosu, mention @dosu.

francescofan commented 3 days ago

@dosu when I ran the code after modification of language, error occured as following: error Traceback (most recent call last): File "d:\Notes\2.Python\next_console\next_console_qa_evaluate\ragas\ragas_test.py", line 112, in generator.adapt(language, evolutions=[simple, reasoning, conditional, multi_context]) File "D:\anaconda3\envs\next_console_qa_evaluate\Lib\site-packages\ragas\testset\generator.py", line 340, in adapt self.docstore.extractor.adapt(language, cache_dir=cache_dir) File "D:\anaconda3\envs\next_console_qa_evaluate\Lib\site-packages\ragas\testset\extractor.py", line 61, in adapt self.extractor_prompt = self.extractor_prompt.adapt( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\next_console_qa_evaluate\Lib\site-packages\ragas\llms\prompt.py", line 185, in adapt self_cp = self._load(language, self.name, cache_dir) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\next_console_qa_evaluate\Lib\site-packages\ragas\llms\prompt.py", line 286, in _load return cls(**json.load(open(path))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\next_console_qa_evaluate\Lib\site-packages\pydantic\v1\main.py", line 341, in init raise validation_error pydantic.v1.error_wrappers.ValidationError: 1 validation error for Prompt root output in example 1 is not in valid json format: Expecting value: line 1 column 1 (char 0) (type=value_error)

code

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

language = "zh"
generator.adapt(language, evolutions=[simple, reasoning, conditional, multi_context])
generator.save(evolutions=[simple, reasoning, multi_context,conditional])
dosubot[bot] commented 3 days ago

To resolve the "ValidationError: output in example 1 is not in valid JSON format" error, ensure that the output field in your examples is a valid JSON string. If the output_type is set to "json", the output must be a properly formatted JSON string. Here's how you can fix it:

  1. Make sure the output field in your examples is a valid JSON string. Use json.dumps() to convert a dictionary or list to a JSON string if necessary.

  2. If the output is already a JSON object (like a dictionary or list), ensure it is correctly formatted without syntax errors.

Here's an example of how to correctly format the output:

import json

examples = [
    {
        "translate_to": "chinese",
        "input": {
            "statements": [
                "Albert Einstein was born in Germany.",
                "Albert Einstein was best known for his theory of relativity.",
            ]
        },
        "output": json.dumps({
            "statements": [
                "阿尔伯特·爱因斯坦出生在德国。",
                "阿尔伯特·爱因斯坦因其相对论而闻名。",
            ]
        }),  # Ensure this is a valid JSON string
    },
    # Add more examples as needed
]

Ensure that the output field is correctly formatted as a JSON string if output_type is "json" [1][2].

To continue talking to Dosu, mention @dosu.

francescofan commented 3 days ago

@dosu based on the message above, where is the output field?

dosubot[bot] commented 3 days ago

The output field is typically part of the examples you provide when using the adapt function. It represents the expected output for a given input, formatted as a JSON string if the output_type is set to "json". In your code, it seems like the output field might be part of the data being processed by the TestsetGenerator.

To locate it, check the data or configuration files you are using with the TestsetGenerator. Look for any section where you define examples or expected outputs, and ensure that these outputs are correctly formatted as JSON strings. If you're using a configuration file or script to define these examples, verify that each output is properly structured and valid JSON.

To continue talking to Dosu, mention @dosu.

francescofan commented 3 days ago

@dosu I havn't define any example manually. Where could be the example located?

dosubot[bot] commented 3 days ago

If you haven't manually defined any examples, the TestsetGenerator in Ragas might be using default examples. The generator determines default examples using the default_query_distribution function when no query distribution is manually defined. This is set in the generate method where query_distribution defaults to default_query_distribution(self.llm) if it is None [1].

To locate these default examples, you would need to check the implementation of the default_query_distribution function or any related configuration within the Ragas framework that might be providing these defaults.

To continue talking to Dosu, mention @dosu.

francescofan commented 3 days ago

@dosu could you tell the excat place or filename where i should check?

dosubot[bot] commented 3 days ago

You should check the default_query_distribution function, which is implemented in the src/ragas/testset/synthesizers/__init__.py file of the Ragas codebase [1]. This function is likely responsible for providing the default examples or query distribution used by the TestsetGenerator.

To continue talking to Dosu, mention @dosu.

francescofan commented 3 days ago

@dosu in 'Lib\site-packages\ragas' anaconda packages directory,no file or directory is named as synthesizers.

dosubot[bot] commented 3 days ago

In the installed package directory, the structure might differ from the source repository. The functionality related to default examples or query distribution could be integrated into other parts of the package.

To troubleshoot further, consider checking the following:

  1. generator.py or extractor.py: These files might contain logic related to generating or adapting examples. Look for any functions or methods that deal with default settings or examples.

  2. Configuration Files: Check if there are any configuration files or settings within the package that specify default behaviors or examples.

  3. Documentation or Comments: Look for any documentation or comments within the code that might indicate where default examples are set or how they are generated.

If these steps don't help, you might need to explore the package's source code more broadly to understand how it handles default examples or query distributions. Alternatively, consider reaching out to the maintainers or community for more specific guidance on the package's structure.

To continue talking to Dosu, mention @dosu.