explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
5.83k stars 552 forks source link

Adding Databricks to the LLM customization list #610

Closed lalehsg closed 1 month ago

lalehsg commented 5 months ago

Describe the Feature Hi. I have to work with databricks foundation models and was wondering if it would be possible to add them to the list of supported llms. The amazon bedrock and vertex ai examples are very similar to the databricks. I am not fluent enough in your code base to do this myself and raise a PR for review. Would appreciate your help.

Why is the feature important for you? I am working on a databrciks cluster with no internet connection. So, configuring other llm judges with RAGAS wouldn't be possible for me.

Additional context here's the code snipped for the integrated databricks model with langchain:

from langchain_community.chat_models import ChatDatabricks

chat_model = ChatDatabricks(endpoint="databricks-llama-2-70b-chat")
chat_model.invoke("How are you?")
jjmachan commented 5 months ago

this should work out of the back actually. with #631 I have hopefully explained the details and you can use bedrock docs as an example.

let me know if you want to contribute this in - we can work together on this 🙂

lalehsg commented 5 months ago

Hi @jjmachan! Thanks so much for getting back to me. Based on the Bedrock example, I was initially hoping that by passing the Databricks model to the Evaluate function, I can make it work; but it doesn't ATM. I think you also mentioned that that should work and Ragas takes the model and wraps in the LangchainLLMWrapper. This is what I have tried:


from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    AnswerSimilarity,
    AnswerCorrectness,
    context_recall,
    context_precision,

)

answer_similarity = AnswerSimilarity()
answer_correctness = AnswerCorrectness()

metrics = [
    faithfulness,
    answer_relevancy,
    answer_similarity,
    answer_correctness,
    context_recall,
    context_precision,
]

from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
fast_embeddings = FastEmbedEmbeddings(model_name="BAAI/bge-base-en")

I created "ds" from a dictionary with the 4 required fields "question", "ground_truth", "answer", and "contexts" and:

fiqa_eval = DatasetDict({"baseline": ds})
from langchain_community.chat_models import ChatDatabricks

chat_model = ChatDatabricks(endpoint="databricks-llama-2-70b-chat")
chat_model.invoke("How are you?")

result = evaluate(fiqa_eval["baseline"], metrics=metrics, embeddings=fast_embeddings, llm = chat_model)

I'm getting this error:

Exception in thread Thread-9:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30077124-4178-4736-9be6-0ce81669262c/lib/python3.10/site-packages/ragas/executor.py", line 75, in run
    results = self.loop.run_until_complete(self._aresults())
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30077124-4178-4736-9be6-0ce81669262c/lib/python3.10/site-packages/ragas/executor.py", line 63, in _aresults
    raise e
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30077124-4178-4736-9be6-0ce81669262c/lib/python3.10/site-packages/ragas/executor.py", line 58, in _aresults
    r = await future
  File "/usr/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30077124-4178-4736-9be6-0ce81669262c/lib/python3.10/site-packages/ragas/executor.py", line 91, in wrapped_callable_async
    return counter, await callable(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30077124-4178-4736-9be6-0ce81669262c/lib/python3.10/site-packages/ragas/metrics/base.py", line 91, in ascore
    raise e
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30077124-4178-4736-9be6-0ce81669262c/lib/python3.10/site-packages/ragas/metrics/base.py", line 87, in ascore
    score = await self._ascore(row=row, callbacks=group_cm, is_async=is_async)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-30077124-4178-4736-9be6-0ce81669262c/lib/python3.10/site-packages/ragas/metrics/_faithfulness.py", line 190, in _ascore
    assert isinstance(statements, dict), "Invalid JSON response"
AssertionError: Invalid JSON response

I would appreciate your help with troubleshooting this. I am more than happy to contribute btw; but I'll need to first make sure I have a working solution.

p.s.: The main difference between my code and the Bedrock example is the Config params that I don't think it's necessary since I am working in the Dbx environment and the model responds when I invoke it as you can see in the above code.

jjmachan commented 5 months ago

do you want to get on a call to help debug this? I suspect this could be either

  1. context window overflow
  2. model is not capable of JSON outputs

@shahules786 is there anything we can do to help him to unblock him fast?

Also @shahules786 do you think a smaller model could do a better job at ensuring JSON format? If it can @lalehsg do you think that is something if we make it available you could use inside your company/setup?

lalehsg commented 5 months ago

Hi @jjmachan.

Regarding the context window size: Let me try to shorten my text then to see if it makes a difference.

Regarding your comment about a smaller model: did you mean a smaller llm judge?

Also, I am available for a call, if that works better.

Thanks!

lalehsg commented 5 months ago

update: So, I tried with much shorter text in the dictionary and the same error happened.

lalehsg commented 5 months ago

Hi @jjmachan and @shahules786 I have been trying to troubleshoot this a bit. Would like to let you know that from the 3 metrics (context_relevancy, answer_relevancy, and faithfulness) that have to prompt the llm judge, only "faithfulness" is causing the above issue. As soon as I comment the faithfulness out, I can get results. I have not looked into the code in details yet to compare and find what is causing the difference but wanted to let you know asap. I am going to unblock my team for now with the current set of metrics but definitely will need faithfulness metric and would appreciate your help with that.

Thanks!