explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.46k stars 756 forks source link

How to adapt language of result just exactly the same to the contexts given which is separated into several files and the files are in different languages. #1435

Open francescofan opened 1 month ago

francescofan commented 1 month ago

[ ] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question As the document of ragas says in "Automatic language adaptation"(https://docs.ragas.io/en/v0.1.1/howtos/applications/use_prompt_adaptation.html#language-adaptation-for-metrics),we can adapt the language to the document using adapt Function. But I suppose that the document shows that I have to set the language in advance, because I have to know the language of the context I submit. But the actual situation is that I have some papers or documents in different language such as English, Japanese, Chinese, French. I would like to get a set of questions and ground-truth results which is written in the exactly the same language with the context I submit. Is that possible? If so, ,how to achieve it?

Code Examples This community speaks code. Share your code snippets to help us understand your question better.

Additional context Anything else you want to share with us?

dosubot[bot] commented 1 month ago

To adapt the language of questions and ground-truth results to match the language of input documents in multiple languages using the Ragas framework, you can use the adapt function. However, the Ragas framework does not have a built-in feature to automatically detect the language of input documents. You will need to manually specify the language for each document when using the adapt function.

Here's a basic example of how you can use the adapt function:

from ragas.metrics import (
    faithfulness,
    answer_correctness,
)
from langchain.chat_models import ChatOpenAI
from ragas import adapt

# Specify the language model to use for adaptation
openai_model = ChatOpenAI(model_name="gpt-4")

# Adapt the metrics to the target language
# Replace "target_language" with the actual language code for each document
adapt(metrics=[faithfulness, answer_correctness], language="target_language", llm=openai_model)

In your case, since the documents are in different languages, you would need to run the adapt function separately for each document, specifying the correct language code (e.g., "en" for English, "ja" for Japanese, "zh" for Chinese, "fr" for French) for each one.

If you want to automate the language detection process, you might consider integrating an external language detection library to determine the language of each document before calling the adapt function. Unfortunately, Ragas does not provide utilities for integrating such libraries directly, so you would need to handle this part separately.

To continue talking to Dosu, mention @dosu.

francescofan commented 1 month ago

@dosu can you list all the language parameter values could be set?

dosubot[bot] commented 1 month ago

The language parameter values you can set typically correspond to standard language codes. Here are some common ones you might use:

These codes follow the ISO 639-1 standard. If you need a language not listed here, you can usually find its code in the ISO 639-1 list.

To continue talking to Dosu, mention @dosu.

francescofan commented 1 month ago

@dosu when I ran the code after modification of language, error occured as following: error Traceback (most recent call last): File "d:\Notes\2.Python\next_console\next_console_qa_evaluate\ragas\ragas_test.py", line 112, in generator.adapt(language, evolutions=[simple, reasoning, conditional, multi_context]) File "D:\anaconda3\envs\next_console_qa_evaluate\Lib\site-packages\ragas\testset\generator.py", line 340, in adapt self.docstore.extractor.adapt(language, cache_dir=cache_dir) File "D:\anaconda3\envs\next_console_qa_evaluate\Lib\site-packages\ragas\testset\extractor.py", line 61, in adapt self.extractor_prompt = self.extractor_prompt.adapt( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\next_console_qa_evaluate\Lib\site-packages\ragas\llms\prompt.py", line 185, in adapt self_cp = self._load(language, self.name, cache_dir) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\next_console_qa_evaluate\Lib\site-packages\ragas\llms\prompt.py", line 286, in _load return cls(**json.load(open(path))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\next_console_qa_evaluate\Lib\site-packages\pydantic\v1\main.py", line 341, in init raise validation_error pydantic.v1.error_wrappers.ValidationError: 1 validation error for Prompt root output in example 1 is not in valid json format: Expecting value: line 1 column 1 (char 0) (type=value_error)

code

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

language = "zh"
generator.adapt(language, evolutions=[simple, reasoning, conditional, multi_context])
generator.save(evolutions=[simple, reasoning, multi_context,conditional])
dosubot[bot] commented 1 month ago

To resolve the "ValidationError: output in example 1 is not in valid JSON format" error, ensure that the output field in your examples is a valid JSON string. If the output_type is set to "json", the output must be a properly formatted JSON string. Here's how you can fix it:

  1. Make sure the output field in your examples is a valid JSON string. Use json.dumps() to convert a dictionary or list to a JSON string if necessary.

  2. If the output is already a JSON object (like a dictionary or list), ensure it is correctly formatted without syntax errors.

Here's an example of how to correctly format the output:

import json

examples = [
    {
        "translate_to": "chinese",
        "input": {
            "statements": [
                "Albert Einstein was born in Germany.",
                "Albert Einstein was best known for his theory of relativity.",
            ]
        },
        "output": json.dumps({
            "statements": [
                "阿尔伯特·爱因斯坦出生在德国。",
                "阿尔伯特·爱因斯坦因其相对论而闻名。",
            ]
        }),  # Ensure this is a valid JSON string
    },
    # Add more examples as needed
]

Ensure that the output field is correctly formatted as a JSON string if output_type is "json" [1][2].

To continue talking to Dosu, mention @dosu.

francescofan commented 1 month ago

@dosu based on the message above, where is the output field?

dosubot[bot] commented 1 month ago

The output field is typically part of the examples you provide when using the adapt function. It represents the expected output for a given input, formatted as a JSON string if the output_type is set to "json". In your code, it seems like the output field might be part of the data being processed by the TestsetGenerator.

To locate it, check the data or configuration files you are using with the TestsetGenerator. Look for any section where you define examples or expected outputs, and ensure that these outputs are correctly formatted as JSON strings. If you're using a configuration file or script to define these examples, verify that each output is properly structured and valid JSON.

To continue talking to Dosu, mention @dosu.

francescofan commented 1 month ago

@dosu I havn't define any example manually. Where could be the example located?

dosubot[bot] commented 1 month ago

If you haven't manually defined any examples, the TestsetGenerator in Ragas might be using default examples. The generator determines default examples using the default_query_distribution function when no query distribution is manually defined. This is set in the generate method where query_distribution defaults to default_query_distribution(self.llm) if it is None [1].

To locate these default examples, you would need to check the implementation of the default_query_distribution function or any related configuration within the Ragas framework that might be providing these defaults.

To continue talking to Dosu, mention @dosu.

francescofan commented 1 month ago

@dosu could you tell the excat place or filename where i should check?

dosubot[bot] commented 1 month ago

You should check the default_query_distribution function, which is implemented in the src/ragas/testset/synthesizers/__init__.py file of the Ragas codebase [1]. This function is likely responsible for providing the default examples or query distribution used by the TestsetGenerator.

To continue talking to Dosu, mention @dosu.

francescofan commented 1 month ago

@dosu in 'Lib\site-packages\ragas' anaconda packages directory,no file or directory is named as synthesizers.

dosubot[bot] commented 1 month ago

In the installed package directory, the structure might differ from the source repository. The functionality related to default examples or query distribution could be integrated into other parts of the package.

To troubleshoot further, consider checking the following:

  1. generator.py or extractor.py: These files might contain logic related to generating or adapting examples. Look for any functions or methods that deal with default settings or examples.

  2. Configuration Files: Check if there are any configuration files or settings within the package that specify default behaviors or examples.

  3. Documentation or Comments: Look for any documentation or comments within the code that might indicate where default examples are set or how they are generated.

If these steps don't help, you might need to explore the package's source code more broadly to understand how it handles default examples or query distributions. Alternatively, consider reaching out to the maintainers or community for more specific guidance on the package's structure.

To continue talking to Dosu, mention @dosu.

francescofan commented 1 month ago

@dosu I solved this problem while a new problem occured. In this funcition, generator.adapt(language, evolutions=[simple, reasoning, conditional, multi_context],cache_dir='xxxx') if i don't set the value of cache_dir, the program may not find the correct directory. Because I am using the Windows system, but the default cache_dir is '/.cache\ragas' which is invalid in Windows. But a new question came to me. After I run the following code:

language = "zh"
generator.adapt(language, evolutions=[simple, reasoning, conditional, multi_context],cache_dir='.\\ragas\\cache\\')
generator.save(evolutions=[simple, reasoning, multi_context,conditional])

A directory 'zh' was created in '.\ragas\cache\' and there are a lot of prompts stored as json files in directory 'zh'. And accoding to the documents, there is no need to adapt the language again. I can run 'generator.generate_with_langchain_docs' function for many times. Is that so? How can I figure what language adaption I am using?

dosubot[bot] commented 1 month ago

Yes, once you've adapted the language and stored the prompts as JSON files in the cache directory, you don't need to adapt the language again. You can repeatedly run the generator.generate_with_langchain_docs function using the cached prompts.

To determine which language adaptation is being used, check the language parameter you set during the adapt function. In your case, it's "zh" for Chinese. The adapt function ensures that the correct language-specific JSON files are loaded from the cache directory when generating with generate_with_langchain_docs [1][2].

To continue talking to Dosu, mention @dosu.

francescofan commented 1 month ago

@dosu Do I need to set then cache_dir value as a GLOBAL parameter? How can generator.generate_with_langchain_docs find which prompts it should use?

dosubot[bot] commented 1 month ago

You don't need to set the cache_dir as a global parameter. It can be passed as an optional argument to the adapt and save methods. If not provided, it defaults to the result of get_cache_dir() [1].

The generator.generate_with_langchain_docs function finds which prompts to use by utilizing the Prompt class. Each prompt includes input_keys and output_key, which help format the prompts with the provided examples. The Prompt class has methods like to_string and format to ensure the correct prompts are used during generation [1][2].

To continue talking to Dosu, mention @dosu.

francescofan commented 1 month ago

@dosu I mean how can generator.generate_with_langchain_docs always use the specific cache_dir prompts. for example, I have '.\ragas\cache\zh' directory in which a lot of prompts JSON file are created. How can I generate_with_langchain_docs with these prompts JSON file always?

dosubot[bot] commented 1 month ago

To ensure generator.generate_with_langchain_docs always uses the specific cache directory with your prompts, you should specify the cache_dir when calling the function. If the function doesn't have a direct parameter for cache_dir, you might need to set it globally or ensure it's set correctly in the environment or configuration used by the generator.

Here's a general approach:

  1. Set the Cache Directory: When you adapt and save, specify the cache_dir as you've done:

    generator.adapt(language, evolutions=[simple, reasoning, conditional, multi_context], cache_dir='.\\ragas\\cache\\')
  2. Use the Cache Directory: Ensure that any function or method that relies on cached prompts is aware of the cache_dir. If generate_with_langchain_docs doesn't take a cache_dir parameter, check if there's a configuration or environment variable that can be set.

  3. Check Documentation: Verify if there's a specific way to set the cache directory for generate_with_langchain_docs in the documentation or source code.

By ensuring the cache directory is consistently set, the generator should use the prompts from your specified location.

To continue talking to Dosu, mention @dosu.

francescofan commented 1 month ago

@dosu I finished your step 1, and there is already a directory '.\ragas\cache\zh' here. As to step 2 generate_with_langchain_docs doesn't take a cache_dir parameter, so where can I find settings and is the setting effective for function 'generate_with_langchain_docs'? As to step 3, I don't find the way to use a specific cache_dir when running function 'generate_with_langchain_docs'. So do i need to run generator.adapt(language, evolutions=[simple, reasoning, conditional, multi_context],cache_dir='.\ragas\cache\') every time when I need to run generate_with_langchain_docs. And finnally another question occured. If I run adpat function second time with the same cache_dir, an error occured that ' output in example 1 is not in valid json format: Expecting value: line 1 column 1 (char 0) (type=value_error)'. I printed the three parameters needed for then function as following: cache_dir:.\ragas\cache language:zh name:keyphrase_extraction And in the cache_dir, there is already a directory name 'zh' which was created last time. Why and how to solve this.

dosubot[bot] commented 1 month ago

To ensure generate_with_langchain_docs uses a specific cache directory, you need to set the cache directory globally using the RAGAS_CACHE_HOME environment variable. This will direct all functions to use the specified cache directory:

import os

os.environ['RAGAS_CACHE_HOME'] = '.\\ragas\\cache\\'

This way, you don't need to specify the cache_dir each time, and it will use the cached prompts from your specified directory.

Regarding the JSON format error when running the adapt function a second time with the same cache directory, it might be due to corrupted or improperly formatted JSON files. Ensure that the JSON files in the cache directory are valid and correctly formatted. If the adapted prompt is already present, it attempts to load it, so any corruption could cause errors. You might need to manually check and fix the JSON files in the cache directory [1][2].

To continue talking to Dosu, mention @dosu.

francescofan commented 1 month ago

@dosu hey, the JSON files were created by generator.save(evolutions=[simple, reasoning, multi_context,conditional]), how could it be invaild,guy?

dosubot[bot] commented 1 month ago

If the JSON files created by generator.save are invalid, it might be due to issues during the save process. Here are a few things to check:

  1. Ensure Valid Data: Before saving, ensure that all data being serialized to JSON is valid. This includes checking for any non-serializable objects or incorrect data structures.

  2. Check for Errors During Save: Look for any errors or warnings in the logs when running generator.save. These might indicate issues with the data being saved.

  3. Inspect JSON Files: Manually inspect the JSON files to ensure they are complete and correctly formatted. Look for any unexpected characters or incomplete data.

  4. Validation: Use a JSON validator to check the files for syntax errors.

If the issue persists, consider adding logging or debugging statements in the save process to capture more details about what might be going wrong.

To continue talking to Dosu, mention @dosu.