explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
5.8k stars 547 forks source link

Evaluating results in languages other than English #402

Open gitrid opened 6 months ago

gitrid commented 6 months ago

Ragas version: 0.0.22 Python version: Python 3.11.7

Can ragas correctly evaluate results in other languages? It seems that fully correct answers receive quite low scores. E.g.

Query: Ile kosztuje opłata miesięczna za prowadzenie Konta Santander dla osoby w wieku 35 lat?
Answer: Opłata miesięczna za prowadzenie Konta Santander dla osoby w wieku 35 lat wynosi 0 zł, pod warunkiem spełnienia warunku zwolnienia z opłaty, którym jest płatność kartą lub BLIKIEM na łączną kwotę co najmniej 300 zł miesięcznie. Jeśli warunek nie jest spełniony, opłata wynosi 6 zł miesięcznie.
Expected: Opłata miesięczna wyniesie 0 zł, jeżeli osoba dokona płatności kartą lub BLIKIEM na łączną kwotę co najmniej 300 zł. Jeżeli nie, opłata wynosi 6 zł.

Results:
  ANSWER CORRECTNESS: 0.48
  Faithfulness: 0.67
  Answer Relevancy: 0.91
  Context Precision: 0.00
  Context Recall: 1.00
  Context Relevancy: 0.05

Translation:

Query: How much is the monthly fee for a Santander Account for a person aged 35?
Answer: The monthly fee for a Santander Account for a person aged 35 is PLN 0, provided that the condition for exemption from the fee is met, which is payment with a card or BLIK for a total amount of at least PLN 300 per month. If the condition is not met, the fee is PLN 6 per month.
Expected: The monthly fee will be PLN 0 if the person makes payments with a card or BLIK for a total amount of at least PLN 300. If not, the fee is PLN 6.

As the answer is exactly correct, I would expect that answer correctness be close to 1. However it is only 0.48, whereas many other queries receive scoring of even 0.8 with a wrong answer. Can it be optimized? Is this an issue with non-English language?

shahules786 commented 6 months ago

Hi @gitrid , we are working on something for this #407 . Your results should improve once this is merged and released :)

mariamaslam commented 6 months ago

@shahules786 please can you look the issue i posted. It would be highly appreciated

weissenbacherpwc commented 6 months ago

@shahules786 is this merged already and how can you choose the specific language?

shahules786 commented 6 months ago

Hi @gitrid Yes, it's merged. We are just preparing the docs for if but you can try it out.

prerequisite

## step1: import any metrics you like
from ragas.metrics import faithfulness

## step2 : adapt to any language
faithfulness.adapt(language="hindi")

## step 3: save adapted prompt for later reuser
faithfulness.save()

## step 4: load dataset evaluate as usual
your_dataset = load_dataset("your_dataset")
ragas_score = evaluate(your_dataset, metrics=[faithfulness])

Make sure to use your best llm while doing language adaptation You can also view the saved prompt by going to .cache/ragas

Your feedback will be very valuable to us :)

shahules786 commented 6 months ago

Hi @gitrid , were you able to use it? We have just merged the feature and docs for it https://docs.ragas.io/en/latest/howtos/applications/use_prompt_adaptation.html Let us know your thoughts

weissenbacherpwc commented 6 months ago

Generally it works @shahules786 thanks!

However bringing own LLMs works only with "ragas==0.0.22". When upgrading Ragas from source I cannot from ragas.llms import LangchainLLM anymore. So for my usecase it is not really helping me, as I want to evaluate with my own LLM in a specific language.

Or how can I load LangchainLLM alternatively?

weissenbacherpwc commented 6 months ago

Hi @gitrid Yes, it's merged. We are just preparing the docs for if but you can try it out.

prerequisite

  • install ragas from source
## step1: import any metrics you like
from ragas.metrics import faithfulness

## step2 : adapt to any language
faithfulness.adapt(language="hindi")

## step 3: save adapted prompt for later reuser
faithfulness.save()

## step 4: load dataset evaluate as usual
your_dataset = load_dataset("your_dataset")
ragas_score = evaluate(your_dataset, metrics=[faithfulness])

Make sure to use your best llm while doing language adaptation You can also view the saved prompt by going to .cache/ragas

Your feedback will be very valuable to us :)

Edit: It does not work on my side, even with the default OpenAI Model. I loaded my OpenAI key and executed the code you provided (and installed Ragas from source ragas-0.0.23.dev25+gbad3a8e). When adapting the language with faithfulness.adapt(language="hindi") I am getting this error: AssertionError: LLM is not set.

So for me it is not working with the OpenAI model as well as with an own model.

shahules786 commented 6 months ago

Hi @gitrid , that's strange. Can you share the full error trace?

mariamaslam commented 6 months ago

hi @shahules786 hope you are doing great. I was able to resolve the validation issue i was getting at my end. I want to understand for better reporting. I followed this n i want my results to be more presentation to the management. https://docs.ragas.io/en/latest/howtos/customisations/azure-openai.html

Please can you guide if possible

weissenbacherpwc commented 6 months ago

if you mean me @shahules786 here is the full error trace:

image
shahules786 commented 6 months ago

Hi @weissenbacherpwc , apologies for the late reply. This was caused due to a recent merge, I have merged a fix and updated the docs for adaptation. Please follow the updated docs here https://docs.ragas.io/en/latest/howtos/applications/use_prompt_adaptation.html

shahules786 commented 6 months ago

Hi @mariamaslam how can I help? Feel free to join our discord server where our community can help you with queries. https://discord.gg/5djav8GGNZ

weissenbacherpwc commented 6 months ago

Nice @shahules786 now it works with a ChatOpenAI model (gpt-4). So when evaluating, it is using gpt-4 as well right? Is there a way to use an own LLM (e.g. Mixtral) for evaluation with a prompt in the specific langauge? When trying to switch to an own LLM, this error persists: ImportError: cannot import name 'LangchainLLM' from 'ragas.llms' (with the installation from source)

shahules786 commented 6 months ago

Hi @weissenbacherpwc , No in the example I have used gpt-4 to adapt, it will be only used for adaptation. For evaluation you can use any model of your choice. There is a slight variation in using llms with langchain

from ragas.llms.base import LangchainLLMWrapper

the rest of the steps should work as is.

weissenbacherpwc commented 6 months ago

got it! will try out and let you know if it works! But good to know that the import for langchain llms changed. Short edit: I think you got a typo and it should be from ragas.llms.base import LangchainLLMWrapper

shahules786 commented 6 months ago

@gitrid @weissenbacherpwc Just updated the docs. Refer to the latest docs for all recent changes

weissenbacherpwc commented 6 months ago

I tried it out, unfortunately it doesn't work. E.g. when using faithfulness with an own LLM, it seems like it is computing the score, however it results in a nan object.

I set the faithfulness prompt to german (worked). Afterwards I bring my own LLM:

from ragas.llms.base import LangchainLLMWrapper
vllm = LangchainLLMWrapper(mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf)
faithfulness.llm = vllm

result = faithfulness.score(mistral_mixtrtal_prompt["train"][0])

When calling result afterwards, I get "nan". However in the cell output, I see that statesments are extracted.

My dataset is structured as follows:

{'question': 'Was ist der Unterschied zwischen einem Blog und einem Forum?',
 'answer': 'Ein Blog beschreibtt....',
 'contexts': ['für den Wissensaustausch eignen....'],
 'ground_truths': ['Ein Blog ist ein OnlineJournal...']}
shahules786 commented 6 months ago

@weissenbacherpwc Apologies for the late reply as I was traveling. I can help you out in person and debug the issue, would you like that? Feel free to book a slot here https://calendly.com/shahules/30min

MatthiasEg commented 3 months ago

@weissenbacherpwc @shahules786 Did you manage to fully evaluate in a different language other than English and by not using API-LLMs? May you provide a short snipped for this?

I currently run into AssertionError: Adapted output keys do not match with the original output keys during nli_statements_message.adapt() inside faithfulness.adapt().

baswenneker commented 1 month ago

The automatic language adaption is a true mess. Tried with several models, including gpt-4o but fails everytime. Here's an example of a GPT-4o adapted score_context.json for Dutch:

{
    "name": "score_context",
    "instruction": "\n    Given a context, perform the following task and output the answer in VALID JSON format: Assess the provided context and assign a numerical score of 1 (Low), 2 (Medium), or 3 (High) for each of the following criteria in your JSON response:\n\nclarity: Evaluate the precision and understandability of the information presented. High scores (3) are reserved for contexts that are both precise in their information and easy to understand. Low scores (1) are for contexts where the information is vague or hard to comprehend.\ndepth: Determine the level of detailed examination and the inclusion of innovative insights within the context. A high score indicates a comprehensive and insightful analysis, while a low score suggests a superficial treatment of the topic.\nstructure: Assess how well the content is organized and whether it flows logically. High scores are awarded to contexts that demonstrate coherent organization and logical progression, whereas low scores indicate a lack of structure or clarity in progression.\nrelevance: Judge the pertinence of the content to the main topic, awarding high scores to contexts tightly focused on the subject without unnecessary digressions, and low scores to those that are cluttered with irrelevant information.\nStructure your JSON output to reflect these criteria as keys with their corresponding scores as values\n    ",
    "output_format_instruction": "The output should be a well-formatted JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\nthe object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\n\nHere is the output JSON schema:\n```\n{\"type\": \"object\", \"properties\": {\"clarity\": {\"title\": \"Clarity\", \"type\": \"integer\"}, \"depth\": {\"title\": \"Depth\", \"type\": \"integer\"}, \"structure\": {\"title\": \"Structure\", \"type\": \"integer\"}, \"relevance\": {\"title\": \"Relevance\", \"type\": \"integer\"}}, \"required\": [\"clarity\", \"depth\", \"structure\", \"relevance\"]}\n```\n\nDo not return any preamble or explanations, return only a pure JSON string surrounded by triple backticks (```).",
    "examples": [
        {
            "context": "translate_to: Dutch\ninput: The Pythagorean theorem is a fundamental principle in geometry. It states that in a right-angled triangle, the square of the length of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the lengths of the other two sides. This can be written as a^2 + b^2 = c^2 where c represents the length of the hypotenuse, and a and b represent the lengths of the other two sides.\noutput: De stelling van Pythagoras is een fundamenteel principe in de meetkunde. Het stelt dat in een rechthoekige driehoek het kwadraat van de lengte van de hypotenusa (de zijde tegenover de rechte hoek) gelijk is aan de som van de kwadraten van de lengtes van de andere twee zijden. Dit kan worden geschreven als a^2 + b^2 = c^2, waarbij c de lengte van de hypotenusa vertegenwoordigt en a en b de lengtes van de andere twee zijden vertegenwoordigen.",
            "output": {
                "clarity": 3,
                "depth": 1,
                "structure": 3,
                "relevance": 3
            }
        },
        {
            "context": "Albert Einstein (14 maart 1879 - 18 april 1955) was een in Duitsland geboren theoretisch natuurkundige die algemeen wordt beschouwd als een van de grootste en meest invloedrijke wetenschappers aller tijden.",
            "output": {
                "clarity": "duidelijkheid",
                "depth": "diepte",
                "structure": "structuur",
                "relevance": "relevantie"
            }
        },
        {
            "context": "translate_to: Dutch\ninput: I love chocolate. It's really tasty. Oh, and by the way, the earth orbits the sun, not the other way around. Also, my favorite color is blue.\noutput: Ik hou van chocolade. Het is echt lekker. Oh, en trouwens, de aarde draait om de zon, niet andersom. Ook is mijn favoriete kleur blauw.",
            "output": {
                "clarity": "duidelijkheid",
                "depth": "diepte",
                "structure": "structuur",
                "relevance": "relevantie"
            }
        }
    ],
    "input_keys": [
        "context"
    ],
    "output_key": "output",
    "output_type": "json",
    "language": "Dutch"
}
Susensio commented 1 month ago

@weissenbacherpwc @shahules786 Did you manage to fully evaluate in a different language other than English and by not using API-LLMs? May you provide a short snipped for this?

I currently run into AssertionError: Adapted output keys do not match with the original output keys during nli_statements_message.adapt() inside faithfulness.adapt().

I'm getting the same error. Did you solve it?