Closed lalehsg closed 1 month ago
this should work out of the back actually. with #631 I have hopefully explained the details and you can use bedrock docs as an example.
let me know if you want to contribute this in - we can work together on this 🙂
Hi @jjmachan! Thanks so much for getting back to me. Based on the Bedrock example, I was initially hoping that by passing the Databricks model to the Evaluate function, I can make it work; but it doesn't ATM. I think you also mentioned that that should work and Ragas takes the model and wraps in the LangchainLLMWrapper. This is what I have tried:
from ragas.metrics import (
faithfulness,
answer_relevancy,
AnswerSimilarity,
AnswerCorrectness,
context_recall,
context_precision,
)
answer_similarity = AnswerSimilarity()
answer_correctness = AnswerCorrectness()
metrics = [
faithfulness,
answer_relevancy,
answer_similarity,
answer_correctness,
context_recall,
context_precision,
]
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
fast_embeddings = FastEmbedEmbeddings(model_name="BAAI/bge-base-en")
I created "ds" from a dictionary with the 4 required fields "question", "ground_truth", "answer", and "contexts" and:
fiqa_eval = DatasetDict({"baseline": ds})
from langchain_community.chat_models import ChatDatabricks
chat_model = ChatDatabricks(endpoint="databricks-llama-2-70b-chat")
chat_model.invoke("How are you?")
result = evaluate(fiqa_eval["baseline"], metrics=metrics, embeddings=fast_embeddings, llm = chat_model)
I'm getting this error:
Exception in thread Thread-9:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30077124-4178-4736-9be6-0ce81669262c/lib/python3.10/site-packages/ragas/executor.py", line 75, in run
results = self.loop.run_until_complete(self._aresults())
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30077124-4178-4736-9be6-0ce81669262c/lib/python3.10/site-packages/ragas/executor.py", line 63, in _aresults
raise e
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30077124-4178-4736-9be6-0ce81669262c/lib/python3.10/site-packages/ragas/executor.py", line 58, in _aresults
r = await future
File "/usr/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
return f.result() # May raise f.exception().
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30077124-4178-4736-9be6-0ce81669262c/lib/python3.10/site-packages/ragas/executor.py", line 91, in wrapped_callable_async
return counter, await callable(*args, **kwargs)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30077124-4178-4736-9be6-0ce81669262c/lib/python3.10/site-packages/ragas/metrics/base.py", line 91, in ascore
raise e
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30077124-4178-4736-9be6-0ce81669262c/lib/python3.10/site-packages/ragas/metrics/base.py", line 87, in ascore
score = await self._ascore(row=row, callbacks=group_cm, is_async=is_async)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30077124-4178-4736-9be6-0ce81669262c/lib/python3.10/site-packages/ragas/metrics/_faithfulness.py", line 190, in _ascore
assert isinstance(statements, dict), "Invalid JSON response"
AssertionError: Invalid JSON response
I would appreciate your help with troubleshooting this. I am more than happy to contribute btw; but I'll need to first make sure I have a working solution.
p.s.: The main difference between my code and the Bedrock example is the Config params that I don't think it's necessary since I am working in the Dbx environment and the model responds when I invoke it as you can see in the above code.
do you want to get on a call to help debug this? I suspect this could be either
@shahules786 is there anything we can do to help him to unblock him fast?
Also @shahules786 do you think a smaller model could do a better job at ensuring JSON format? If it can @lalehsg do you think that is something if we make it available you could use inside your company/setup?
Hi @jjmachan.
Regarding the context window size: Let me try to shorten my text then to see if it makes a difference.
Regarding your comment about a smaller model: did you mean a smaller llm judge?
Also, I am available for a call, if that works better.
Thanks!
update: So, I tried with much shorter text in the dictionary and the same error happened.
Hi @jjmachan and @shahules786 I have been trying to troubleshoot this a bit. Would like to let you know that from the 3 metrics (context_relevancy, answer_relevancy, and faithfulness) that have to prompt the llm judge, only "faithfulness" is causing the above issue. As soon as I comment the faithfulness out, I can get results. I have not looked into the code in details yet to compare and find what is causing the difference but wanted to let you know asap. I am going to unblock my team for now with the current set of metrics but definitely will need faithfulness metric and would appreciate your help with that.
Thanks!
Describe the Feature Hi. I have to work with databricks foundation models and was wondering if it would be possible to add them to the list of supported llms. The amazon bedrock and vertex ai examples are very similar to the databricks. I am not fluent enough in your code base to do this myself and raise a PR for review. Would appreciate your help.
Why is the feature important for you? I am working on a databrciks cluster with no internet connection. So, configuring other llm judges with RAGAS wouldn't be possible for me.
Additional context here's the code snipped for the integrated databricks model with langchain: