Open francescofan opened 1 month ago
To adapt the language of questions and ground-truth results to match the language of input documents in multiple languages using the Ragas framework, you can use the adapt
function. However, the Ragas framework does not have a built-in feature to automatically detect the language of input documents. You will need to manually specify the language for each document when using the adapt
function.
Here's a basic example of how you can use the adapt
function:
from ragas.metrics import (
faithfulness,
answer_correctness,
)
from langchain.chat_models import ChatOpenAI
from ragas import adapt
# Specify the language model to use for adaptation
openai_model = ChatOpenAI(model_name="gpt-4")
# Adapt the metrics to the target language
# Replace "target_language" with the actual language code for each document
adapt(metrics=[faithfulness, answer_correctness], language="target_language", llm=openai_model)
In your case, since the documents are in different languages, you would need to run the adapt
function separately for each document, specifying the correct language code (e.g., "en" for English, "ja" for Japanese, "zh" for Chinese, "fr" for French) for each one.
If you want to automate the language detection process, you might consider integrating an external language detection library to determine the language of each document before calling the adapt
function. Unfortunately, Ragas does not provide utilities for integrating such libraries directly, so you would need to handle this part separately.
To continue talking to Dosu, mention @dosu.
@dosu can you list all the language parameter values could be set?
The language parameter values you can set typically correspond to standard language codes. Here are some common ones you might use:
These codes follow the ISO 639-1 standard. If you need a language not listed here, you can usually find its code in the ISO 639-1 list.
To continue talking to Dosu, mention @dosu.
@dosu
when I ran the code after modification of language, error occured as following:
error
Traceback (most recent call last):
File "d:\Notes\2.Python\next_console\next_console_qa_evaluate\ragas\ragas_test.py", line 112, in
code
generator = TestsetGenerator.from_langchain(
generator_llm,
critic_llm,
embeddings
)
language = "zh"
generator.adapt(language, evolutions=[simple, reasoning, conditional, multi_context])
generator.save(evolutions=[simple, reasoning, multi_context,conditional])
To resolve the "ValidationError: output in example 1 is not in valid JSON format" error, ensure that the output
field in your examples is a valid JSON string. If the output_type
is set to "json", the output
must be a properly formatted JSON string. Here's how you can fix it:
Make sure the output
field in your examples is a valid JSON string. Use json.dumps()
to convert a dictionary or list to a JSON string if necessary.
If the output
is already a JSON object (like a dictionary or list), ensure it is correctly formatted without syntax errors.
Here's an example of how to correctly format the output
:
import json
examples = [
{
"translate_to": "chinese",
"input": {
"statements": [
"Albert Einstein was born in Germany.",
"Albert Einstein was best known for his theory of relativity.",
]
},
"output": json.dumps({
"statements": [
"阿尔伯特·爱因斯坦出生在德国。",
"阿尔伯特·爱因斯坦因其相对论而闻名。",
]
}), # Ensure this is a valid JSON string
},
# Add more examples as needed
]
Ensure that the output
field is correctly formatted as a JSON string if output_type
is "json" [1][2].
To continue talking to Dosu, mention @dosu.
@dosu based on the message above, where is the output field?
The output
field is typically part of the examples you provide when using the adapt
function. It represents the expected output for a given input, formatted as a JSON string if the output_type
is set to "json". In your code, it seems like the output
field might be part of the data being processed by the TestsetGenerator
.
To locate it, check the data or configuration files you are using with the TestsetGenerator
. Look for any section where you define examples or expected outputs, and ensure that these outputs are correctly formatted as JSON strings. If you're using a configuration file or script to define these examples, verify that each output
is properly structured and valid JSON.
To continue talking to Dosu, mention @dosu.
@dosu I havn't define any example manually. Where could be the example located?
If you haven't manually defined any examples, the TestsetGenerator
in Ragas might be using default examples. The generator determines default examples using the default_query_distribution
function when no query distribution is manually defined. This is set in the generate
method where query_distribution
defaults to default_query_distribution(self.llm)
if it is None
[1].
To locate these default examples, you would need to check the implementation of the default_query_distribution
function or any related configuration within the Ragas framework that might be providing these defaults.
To continue talking to Dosu, mention @dosu.
@dosu could you tell the excat place or filename where i should check?
You should check the default_query_distribution
function, which is implemented in the src/ragas/testset/synthesizers/__init__.py
file of the Ragas codebase [1]. This function is likely responsible for providing the default examples or query distribution used by the TestsetGenerator
.
To continue talking to Dosu, mention @dosu.
@dosu in 'Lib\site-packages\ragas' anaconda packages directory,no file or directory is named as synthesizers.
In the installed package directory, the structure might differ from the source repository. The functionality related to default examples or query distribution could be integrated into other parts of the package.
To troubleshoot further, consider checking the following:
generator.py
or extractor.py
: These files might contain logic related to generating or adapting examples. Look for any functions or methods that deal with default settings or examples.
Configuration Files: Check if there are any configuration files or settings within the package that specify default behaviors or examples.
Documentation or Comments: Look for any documentation or comments within the code that might indicate where default examples are set or how they are generated.
If these steps don't help, you might need to explore the package's source code more broadly to understand how it handles default examples or query distributions. Alternatively, consider reaching out to the maintainers or community for more specific guidance on the package's structure.
To continue talking to Dosu, mention @dosu.
@dosu I solved this problem while a new problem occured. In this funcition, generator.adapt(language, evolutions=[simple, reasoning, conditional, multi_context],cache_dir='xxxx') if i don't set the value of cache_dir, the program may not find the correct directory. Because I am using the Windows system, but the default cache_dir is '/.cache\ragas' which is invalid in Windows. But a new question came to me. After I run the following code:
language = "zh"
generator.adapt(language, evolutions=[simple, reasoning, conditional, multi_context],cache_dir='.\\ragas\\cache\\')
generator.save(evolutions=[simple, reasoning, multi_context,conditional])
A directory 'zh' was created in '.\ragas\cache\' and there are a lot of prompts stored as json files in directory 'zh'. And accoding to the documents, there is no need to adapt the language again. I can run 'generator.generate_with_langchain_docs' function for many times. Is that so? How can I figure what language adaption I am using?
Yes, once you've adapted the language and stored the prompts as JSON files in the cache directory, you don't need to adapt the language again. You can repeatedly run the generator.generate_with_langchain_docs
function using the cached prompts.
To determine which language adaptation is being used, check the language parameter you set during the adapt
function. In your case, it's "zh" for Chinese. The adapt
function ensures that the correct language-specific JSON files are loaded from the cache directory when generating with generate_with_langchain_docs
[1][2].
To continue talking to Dosu, mention @dosu.
@dosu Do I need to set then cache_dir value as a GLOBAL parameter? How can generator.generate_with_langchain_docs find which prompts it should use?
You don't need to set the cache_dir
as a global parameter. It can be passed as an optional argument to the adapt
and save
methods. If not provided, it defaults to the result of get_cache_dir()
[1].
The generator.generate_with_langchain_docs
function finds which prompts to use by utilizing the Prompt
class. Each prompt includes input_keys
and output_key
, which help format the prompts with the provided examples. The Prompt
class has methods like to_string
and format
to ensure the correct prompts are used during generation [1][2].
To continue talking to Dosu, mention @dosu.
@dosu I mean how can generator.generate_with_langchain_docs always use the specific cache_dir prompts. for example, I have '.\ragas\cache\zh' directory in which a lot of prompts JSON file are created. How can I generate_with_langchain_docs with these prompts JSON file always?
To ensure generator.generate_with_langchain_docs
always uses the specific cache directory with your prompts, you should specify the cache_dir
when calling the function. If the function doesn't have a direct parameter for cache_dir
, you might need to set it globally or ensure it's set correctly in the environment or configuration used by the generator.
Here's a general approach:
Set the Cache Directory: When you adapt and save, specify the cache_dir
as you've done:
generator.adapt(language, evolutions=[simple, reasoning, conditional, multi_context], cache_dir='.\\ragas\\cache\\')
Use the Cache Directory: Ensure that any function or method that relies on cached prompts is aware of the cache_dir
. If generate_with_langchain_docs
doesn't take a cache_dir
parameter, check if there's a configuration or environment variable that can be set.
Check Documentation: Verify if there's a specific way to set the cache directory for generate_with_langchain_docs
in the documentation or source code.
By ensuring the cache directory is consistently set, the generator should use the prompts from your specified location.
To continue talking to Dosu, mention @dosu.
@dosu I finished your step 1, and there is already a directory '.\ragas\cache\zh' here. As to step 2 generate_with_langchain_docs doesn't take a cache_dir parameter, so where can I find settings and is the setting effective for function 'generate_with_langchain_docs'? As to step 3, I don't find the way to use a specific cache_dir when running function 'generate_with_langchain_docs'. So do i need to run generator.adapt(language, evolutions=[simple, reasoning, conditional, multi_context],cache_dir='.\ragas\cache\') every time when I need to run generate_with_langchain_docs. And finnally another question occured. If I run adpat function second time with the same cache_dir, an error occured that ' output in example 1 is not in valid json format: Expecting value: line 1 column 1 (char 0) (type=value_error)'. I printed the three parameters needed for then function as following: cache_dir:.\ragas\cache language:zh name:keyphrase_extraction And in the cache_dir, there is already a directory name 'zh' which was created last time. Why and how to solve this.
To ensure generate_with_langchain_docs
uses a specific cache directory, you need to set the cache directory globally using the RAGAS_CACHE_HOME
environment variable. This will direct all functions to use the specified cache directory:
import os
os.environ['RAGAS_CACHE_HOME'] = '.\\ragas\\cache\\'
This way, you don't need to specify the cache_dir
each time, and it will use the cached prompts from your specified directory.
Regarding the JSON format error when running the adapt
function a second time with the same cache directory, it might be due to corrupted or improperly formatted JSON files. Ensure that the JSON files in the cache directory are valid and correctly formatted. If the adapted prompt is already present, it attempts to load it, so any corruption could cause errors. You might need to manually check and fix the JSON files in the cache directory [1][2].
To continue talking to Dosu, mention @dosu.
@dosu hey, the JSON files were created by generator.save(evolutions=[simple, reasoning, multi_context,conditional]), how could it be invaild,guy?
If the JSON files created by generator.save
are invalid, it might be due to issues during the save process. Here are a few things to check:
Ensure Valid Data: Before saving, ensure that all data being serialized to JSON is valid. This includes checking for any non-serializable objects or incorrect data structures.
Check for Errors During Save: Look for any errors or warnings in the logs when running generator.save
. These might indicate issues with the data being saved.
Inspect JSON Files: Manually inspect the JSON files to ensure they are complete and correctly formatted. Look for any unexpected characters or incomplete data.
Validation: Use a JSON validator to check the files for syntax errors.
If the issue persists, consider adding logging or debugging statements in the save process to capture more details about what might be going wrong.
To continue talking to Dosu, mention @dosu.
[ ] I checked the documentation and related resources and couldn't find an answer to my question.
Your Question As the document of ragas says in "Automatic language adaptation"(https://docs.ragas.io/en/v0.1.1/howtos/applications/use_prompt_adaptation.html#language-adaptation-for-metrics),we can adapt the language to the document using adapt Function. But I suppose that the document shows that I have to set the language in advance, because I have to know the language of the context I submit. But the actual situation is that I have some papers or documents in different language such as English, Japanese, Chinese, French. I would like to get a set of questions and ground-truth results which is written in the exactly the same language with the context I submit. Is that possible? If so, ,how to achieve it?
Code Examples This community speaks code. Share your code snippets to help us understand your question better.
Additional context Anything else you want to share with us?