explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
5.66k stars 527 forks source link

Automatic language adaption TestSet generation error: "Adapted output keys do not match with the original output keys" #774

Open baswenneker opened 3 months ago

baswenneker commented 3 months ago

Describe the bug Can't get the automatic language adaption going for testset generation. I retried this about 10 times.

Ragas version: 0.1.4 Python version: 3.11

Code to Reproduce Share code to reproduce the issue

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context,conditional
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# generator with openai models
generator_llm = azure_llm()
critic_llm = azure_llm()
embeddings = azure_embeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# adapt to language
language = "Dutch"
cache_dir = ".cache"

generator.adapt(language, evolutions=[simple], cache_dir=cache_dir)
generator.save(evolutions=[simple, reasoning, multi_context, conditional], cache_dir=cache_dir)

This is the output until it errors out:

{'keyphrases': ['Zwart gat', 'Regio van ruimtetijd', 'Sterke zwaartekracht', 'Licht en elektromagnetische golven', 'Theorie van algemene relativiteit']}
{'keyphrases': ['Chinese Muur', 'Oude vestingwerken', 'Noord-China']}
{'answer': 'Menselijke activiteiten dragen voornamelijk bij aan klimaatverandering door de uitstoot van broeikasgassen bij het verbranden van fossiele brandstoffen. Deze uitstoot verhoogt de concentratie van broeikasgassen in de atmosfeer, wat meer warmte vasthoudt en leidt tot opwarming van de aarde en veranderde weerspatronen.', 'verdict': '1'}
{'answer': 'Kunstmatige intelligentie is ontworpen om menselijke cognitieve functies na te bootsen, met belangrijke capaciteiten zoals leren, redeneren, waarnemen en reageren op de omgeving op een manier die vergelijkbaar is met mensen. Deze capaciteiten maken AI cruciaal in verschillende velden, inclusief gezondheidszorg en autonoom rijden.', 'verdict': '1'}
{'answer': 'Het antwoord op de gegeven vraag is niet aanwezig in de context', 'verdict': '-1'}
{'relevant_contexts': [1, 2]}
[[1, 2], {'relevant_contexts': [1, 2]}]
{'score': 6.0}
[{'statements': ['अल्बर्ट आइंस्टीन का जन्म जर्मनी में हुआ था।', 'अल्बर्ट आइंस्टीन अपने सापेक्षता के सिद्धांत के लिए सबसे अधिक प्रसिद्ध थे।']}, {'feedback': "De vraag is te vaag en breed, het vraagt om een 'ontdekking over de ruimte' zonder een specifiek aspect, tijdskader of context van interesse te specificeren. Dit kan verwijzen naar een breed scala aan onderwerpen, van de ontdekking van nieuwe hemellichamen tot vooruitgang in de technologie van ruimtereizen. Om de duidelijkheid en beantwoordbaarheid te verbeteren, zou de vraag het type ontdekking (bijv. astronomisch, technologisch), het tijdskader (bijv. recent, historisch) of de context (bijv. binnen een specifieke onderzoeksstudie of ruimtemissie) kunnen specificeren.", 'verdict': '0'}]

Error trace

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[4], [line 20](vscode-notebook-cell:?execution_count=4&line=20)
     [17](vscode-notebook-cell:?execution_count=4&line=17) language = "Dutch"
     [18](vscode-notebook-cell:?execution_count=4&line=18) cache_dir = ".cache"
---> [20](vscode-notebook-cell:?execution_count=4&line=20) generator.adapt(language, evolutions=[simple], cache_dir=cache_dir)
     [21](vscode-notebook-cell:?execution_count=4&line=21) generator.save(evolutions=[simple, reasoning, multi_context, conditional], cache_dir=cache_dir)

File [~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/generator.py:311](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/generator.py:311), in TestsetGenerator.adapt(self, language, evolutions, cache_dir)
    [309](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/generator.py:309) self.init_evolution(evolution)
    [310](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/generator.py:310) evolution.init()
--> [311](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/generator.py:311) evolution.adapt(language, cache_dir=cache_dir)

File [~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:324](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:324), in SimpleEvolution.adapt(self, language, cache_dir)
    [323](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:323) def adapt(self, language: str, cache_dir: t.Optional[str] = None) -> None:
--> [324](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:324)     super().adapt(language, cache_dir)
    [325](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:325)     self.seed_question_prompt = self.seed_question_prompt.adapt(
    [326](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:326)         language, self.generator_llm, cache_dir
    [327](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:327)     )

File [~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:261](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:261), in Evolution.adapt(self, language, cache_dir)
    [255](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:255) self.rewrite_invalid_question_prompt = (
    [256](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:256)     self.rewrite_invalid_question_prompt.adapt(
    [257](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:257)         language, self.generator_llm, cache_dir
    [258](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:258)     )
    [259](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:259) )
    [260](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:260) self.node_filter.adapt(language, cache_dir)
--> [261](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/evolutions.py:261) self.question_filter.adapt(language, cache_dir)

File [~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/filters.py:97](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/filters.py:97), in QuestionFilter.adapt(self, language, cache_dir)
     [93](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/filters.py:93) def adapt(self, language: str, cache_dir: t.Optional[str] = None) -> None:
     [94](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/filters.py:94)     """
     [95](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/filters.py:95)     Adapt the filter to a different language.
     [96](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/filters.py:96)     """
---> [97](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/filters.py:97)     self.filter_question_prompt = self.filter_question_prompt.adapt(
     [98](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/filters.py:98)         language, self.llm, cache_dir
     [99](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/testset/filters.py:99)     )

File [~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/llms/prompt.py:236](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/llms/prompt.py:236), in Prompt.adapt(self, language, llm, cache_dir)
    [230](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/llms/prompt.py:230)             assert (
    [231](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/llms/prompt.py:231)                 set(output.keys()) == output_keys[i]
    [232](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/llms/prompt.py:232)             ), f"Adapted output keys {set(output.keys())=} do not match with the original output keys: {output_keys[i]=}"
    [233](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/llms/prompt.py:233)         elif isinstance(output, list) and all(
    [234](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/llms/prompt.py:234)             isinstance(item, dict) for item in output
    [235](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/llms/prompt.py:235)         ):
--> [236](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/llms/prompt.py:236)             assert all(
    [237](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/llms/prompt.py:237)                 set(item.keys()) in output_keys[i] for item in output
    [238](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/llms/prompt.py:238)             ), "Adapted output keys do not match with the original output keys"
    [240](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/llms/prompt.py:240)     self.examples[i] = example_dict
    [242](https://file+.vscode-resource.vscode-cdn.net/Users/bas/Development/HeadingFWD/evaluation-playground/~/Development/HeadingFWD/evaluation-playground/.venv/lib/python3.11/site-packages/ragas/llms/prompt.py:242) self.language = language

Expected behavior No error!

Additional context Using langchain with azure openai endpoint.

shahules786 commented 3 months ago

Hey @baswenneker thanks for reporting the issue. I would recommend to try it again with gpt-4.

Meanwhile I will work on a fix for it.

baswenneker commented 3 months ago

@shahules786 I'm using GPT-4 already. Tried like 20 times without any luck. A manual on how to rewrite the prompts by hand would be nice!

shahules786 commented 3 months ago

Hey @baswenneker lemme try this out. We are currently changing some structures related to prompt - so I can test your case as well. thank you

baswenneker commented 3 months ago

Cool, let me know if I can help @shahules786!

baswenneker commented 3 months ago

@shahules786 I added an extra set of examples to the translation prompts and it worked. I made a pull request for this:

https://github.com/explodinggradients/ragas/pull/826

adrienB134 commented 2 months ago

Had the same issue for french, so I made a pull request adding examples for french: #857

louky123 commented 2 months ago

I had the same issue, indeed for dutch. It occurs because ' {'relevant_contexts': [1, 2]}]' cannot be translated in the adapt function and therefore example[-1] in prompt.py is of a strange with text added to it. And then the json_loader._safe_load(example[-1], llm) returns an empty dict {}. Which does not correspond to the output_keys[i] whichis 'relevant_contexts'. I fixed it by replacing: _example_dict[self.output_key] = ( json_loader._safe_load(example[-1], llm) if self.outputtype.lower() == "json" else example[-1] ) With if self.output_type.lower() == "json": example_dict[self.output_key] = json_loader._safe_load(example[-1], llm) if example_dict[self.output_key] == {}:

Extracting the dictionary part using string slicing

                dict_str = example[-1].split('(')[0].strip()
                example_dict[self.output_key ] =  ast.literal_eval(dict_str)
        else:
            example_dict[self.output_key] = example[-1]

Which strips example[-1] and turns it into a string, which can be used. I know its not the neatest solutions, I will try to improve that. Hope it helps

mattevsz commented 1 month ago

Same issue here! Fellow Dutchie ;) I did the following:

convert metric prompts to Dutch with 4o ragas_metrics_nl.txt

and voila! image