Open gitrid opened 6 months ago
Hi @gitrid , we are working on something for this #407 . Your results should improve once this is merged and released :)
@shahules786 please can you look the issue i posted. It would be highly appreciated
@shahules786 is this merged already and how can you choose the specific language?
Hi @gitrid Yes, it's merged. We are just preparing the docs for if but you can try it out.
prerequisite
## step1: import any metrics you like
from ragas.metrics import faithfulness
## step2 : adapt to any language
faithfulness.adapt(language="hindi")
## step 3: save adapted prompt for later reuser
faithfulness.save()
## step 4: load dataset evaluate as usual
your_dataset = load_dataset("your_dataset")
ragas_score = evaluate(your_dataset, metrics=[faithfulness])
Make sure to use your best llm while doing language adaptation
You can also view the saved prompt by going to .cache/ragas
Your feedback will be very valuable to us :)
Hi @gitrid , were you able to use it? We have just merged the feature and docs for it https://docs.ragas.io/en/latest/howtos/applications/use_prompt_adaptation.html Let us know your thoughts
Generally it works @shahules786 thanks!
However bringing own LLMs works only with "ragas==0.0.22". When upgrading Ragas from source I cannot from ragas.llms import LangchainLLM
anymore. So for my usecase it is not really helping me, as I want to evaluate with my own LLM in a specific language.
Or how can I load LangchainLLM alternatively?
Hi @gitrid Yes, it's merged. We are just preparing the docs for if but you can try it out.
prerequisite
- install ragas from source
## step1: import any metrics you like from ragas.metrics import faithfulness ## step2 : adapt to any language faithfulness.adapt(language="hindi") ## step 3: save adapted prompt for later reuser faithfulness.save() ## step 4: load dataset evaluate as usual your_dataset = load_dataset("your_dataset") ragas_score = evaluate(your_dataset, metrics=[faithfulness])
Make sure to use your best llm while doing language adaptation You can also view the saved prompt by going to
.cache/ragas
Your feedback will be very valuable to us :)
Edit: It does not work on my side, even with the default OpenAI Model. I loaded my OpenAI key and executed the code you provided (and installed Ragas from source ragas-0.0.23.dev25+gbad3a8e
). When adapting the language with faithfulness.adapt(language="hindi")
I am getting this error: AssertionError: LLM is not set
.
So for me it is not working with the OpenAI model as well as with an own model.
Hi @gitrid , that's strange. Can you share the full error trace?
hi @shahules786 hope you are doing great. I was able to resolve the validation issue i was getting at my end. I want to understand for better reporting. I followed this n i want my results to be more presentation to the management. https://docs.ragas.io/en/latest/howtos/customisations/azure-openai.html
Please can you guide if possible
if you mean me @shahules786 here is the full error trace:
Hi @weissenbacherpwc , apologies for the late reply. This was caused due to a recent merge, I have merged a fix and updated the docs for adaptation. Please follow the updated docs here https://docs.ragas.io/en/latest/howtos/applications/use_prompt_adaptation.html
Hi @mariamaslam how can I help? Feel free to join our discord server where our community can help you with queries. https://discord.gg/5djav8GGNZ
Nice @shahules786 now it works with a ChatOpenAI model (gpt-4). So when evaluating, it is using gpt-4 as well right?
Is there a way to use an own LLM (e.g. Mixtral) for evaluation with a prompt in the specific langauge? When trying to switch to an own LLM, this error persists: ImportError: cannot import name 'LangchainLLM' from 'ragas.llms'
(with the installation from source)
Hi @weissenbacherpwc , No in the example I have used gpt-4
to adapt, it will be only used for adaptation. For evaluation you can use any model of your choice. There is a slight variation in using llms with langchain
from ragas.llms.base import LangchainLLMWrapper
the rest of the steps should work as is.
got it! will try out and let you know if it works! But good to know that the import for langchain llms changed.
Short edit: I think you got a typo and it should be from ragas.llms.base import LangchainLLMWrapper
@gitrid @weissenbacherpwc Just updated the docs. Refer to the latest docs for all recent changes
I tried it out, unfortunately it doesn't work. E.g. when using faithfulness with an own LLM, it seems like it is computing the score, however it results in a nan
object.
I set the faithfulness prompt to german (worked). Afterwards I bring my own LLM:
from ragas.llms.base import LangchainLLMWrapper
vllm = LangchainLLMWrapper(mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf)
faithfulness.llm = vllm
result = faithfulness.score(mistral_mixtrtal_prompt["train"][0])
When calling result afterwards, I get "nan". However in the cell output, I see that statesments are extracted.
My dataset is structured as follows:
{'question': 'Was ist der Unterschied zwischen einem Blog und einem Forum?',
'answer': 'Ein Blog beschreibtt....',
'contexts': ['für den Wissensaustausch eignen....'],
'ground_truths': ['Ein Blog ist ein OnlineJournal...']}
@weissenbacherpwc Apologies for the late reply as I was traveling. I can help you out in person and debug the issue, would you like that? Feel free to book a slot here https://calendly.com/shahules/30min
@weissenbacherpwc @shahules786 Did you manage to fully evaluate in a different language other than English and by not using API-LLMs? May you provide a short snipped for this?
I currently run into AssertionError: Adapted output keys do not match with the original output keys during nli_statements_message.adapt() inside faithfulness.adapt().
The automatic language adaption is a true mess. Tried with several models, including gpt-4o but fails everytime. Here's an example of a GPT-4o adapted score_context.json for Dutch:
{
"name": "score_context",
"instruction": "\n Given a context, perform the following task and output the answer in VALID JSON format: Assess the provided context and assign a numerical score of 1 (Low), 2 (Medium), or 3 (High) for each of the following criteria in your JSON response:\n\nclarity: Evaluate the precision and understandability of the information presented. High scores (3) are reserved for contexts that are both precise in their information and easy to understand. Low scores (1) are for contexts where the information is vague or hard to comprehend.\ndepth: Determine the level of detailed examination and the inclusion of innovative insights within the context. A high score indicates a comprehensive and insightful analysis, while a low score suggests a superficial treatment of the topic.\nstructure: Assess how well the content is organized and whether it flows logically. High scores are awarded to contexts that demonstrate coherent organization and logical progression, whereas low scores indicate a lack of structure or clarity in progression.\nrelevance: Judge the pertinence of the content to the main topic, awarding high scores to contexts tightly focused on the subject without unnecessary digressions, and low scores to those that are cluttered with irrelevant information.\nStructure your JSON output to reflect these criteria as keys with their corresponding scores as values\n ",
"output_format_instruction": "The output should be a well-formatted JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\nthe object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\n\nHere is the output JSON schema:\n```\n{\"type\": \"object\", \"properties\": {\"clarity\": {\"title\": \"Clarity\", \"type\": \"integer\"}, \"depth\": {\"title\": \"Depth\", \"type\": \"integer\"}, \"structure\": {\"title\": \"Structure\", \"type\": \"integer\"}, \"relevance\": {\"title\": \"Relevance\", \"type\": \"integer\"}}, \"required\": [\"clarity\", \"depth\", \"structure\", \"relevance\"]}\n```\n\nDo not return any preamble or explanations, return only a pure JSON string surrounded by triple backticks (```).",
"examples": [
{
"context": "translate_to: Dutch\ninput: The Pythagorean theorem is a fundamental principle in geometry. It states that in a right-angled triangle, the square of the length of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the lengths of the other two sides. This can be written as a^2 + b^2 = c^2 where c represents the length of the hypotenuse, and a and b represent the lengths of the other two sides.\noutput: De stelling van Pythagoras is een fundamenteel principe in de meetkunde. Het stelt dat in een rechthoekige driehoek het kwadraat van de lengte van de hypotenusa (de zijde tegenover de rechte hoek) gelijk is aan de som van de kwadraten van de lengtes van de andere twee zijden. Dit kan worden geschreven als a^2 + b^2 = c^2, waarbij c de lengte van de hypotenusa vertegenwoordigt en a en b de lengtes van de andere twee zijden vertegenwoordigen.",
"output": {
"clarity": 3,
"depth": 1,
"structure": 3,
"relevance": 3
}
},
{
"context": "Albert Einstein (14 maart 1879 - 18 april 1955) was een in Duitsland geboren theoretisch natuurkundige die algemeen wordt beschouwd als een van de grootste en meest invloedrijke wetenschappers aller tijden.",
"output": {
"clarity": "duidelijkheid",
"depth": "diepte",
"structure": "structuur",
"relevance": "relevantie"
}
},
{
"context": "translate_to: Dutch\ninput: I love chocolate. It's really tasty. Oh, and by the way, the earth orbits the sun, not the other way around. Also, my favorite color is blue.\noutput: Ik hou van chocolade. Het is echt lekker. Oh, en trouwens, de aarde draait om de zon, niet andersom. Ook is mijn favoriete kleur blauw.",
"output": {
"clarity": "duidelijkheid",
"depth": "diepte",
"structure": "structuur",
"relevance": "relevantie"
}
}
],
"input_keys": [
"context"
],
"output_key": "output",
"output_type": "json",
"language": "Dutch"
}
@weissenbacherpwc @shahules786 Did you manage to fully evaluate in a different language other than English and by not using API-LLMs? May you provide a short snipped for this?
I currently run into AssertionError: Adapted output keys do not match with the original output keys during nli_statements_message.adapt() inside faithfulness.adapt().
I'm getting the same error. Did you solve it?
Ragas version: 0.0.22 Python version: Python 3.11.7
Can ragas correctly evaluate results in other languages? It seems that fully correct answers receive quite low scores. E.g.
Translation:
As the answer is exactly correct, I would expect that answer correctness be close to 1. However it is only 0.48, whereas many other queries receive scoring of even 0.8 with a wrong answer. Can it be optimized? Is this an issue with non-English language?