Open piseabhijeet opened 8 months ago
Hey @piseabhijeet , if you're using it with LLMTestCase
then it seems like for some of your test cases in your evaluation dataset the retrieval context is missing. Can you try with just one test case in your evaluation dataset and slowly work it up to debug this issue?
Hi @penguine-ip
Actually I tried running the same for two metrics - correctness and hallucination - it works perfectly.. but when I try to include all the other metrics like faithfulness, contextual similarity, it fails. this is weird..
if it was a missing value issue, the two metrics wouldn't have run successfully.
Oh i see, correctness and hallcuination does not require retrieval_context
but other metrics you mentioned does.
I assume you only provided context
but not retrieval_context
. Are you building something RAG? if you are, don't use hallucination and switch to faithfulness altogether.
I see, got your point! Thanks for the quick response.
Also if a dataset is created inside a module and metrics are passed as list, do we explicitly specify number of workers? if yes, where do I set it?
I see '-n' as parameter on command line but was wondering if a similar parameter is there inside dataset.evaluate module.
Where are evaluation results stored? I see a temporary json file in the same folder but its not getting updated when running bulk eval.
@piseabhijeet You should be able to store results locally using this instruction: https://docs.confident-ai.com/docs/getting-started#create-your-first-test-case
Let me know if it doesn't work. There's always the option to login to Confident AI too with this one liner in the CLI: deepeval login
You can also do it in python code: https://docs.confident-ai.com/docs/confident-ai-introduction (scroll to bottom)
Not yet, right now we execute each test case individually, and then for each test case execute each metric individually (so like a double for loop)
We're going to be releasing a big update tomorrow that allows you to run metrics concurrently on a test case, (so a single for loop)
You don't have to specify workers for this, we'll handle everything under the hood with coroutines. If you are using a custom model (eg., https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm), let me know because I'll need to help you support async execution.
The -n flag in deepeval test run allows you to run multiple test cases at once using multiple processes in parallel. We're not supporting multiprocessing for the evaluate function for now.
@piseabhijeet also come join our discord, there will be major updates over the weekend and any breaking changes will be updated there: https://discord.com/invite/a3K9c8GRGt
thanks for the prompt response @penguine-ip !
hey @piseabhijeet, updates on discord, but in short evaluate function now runs all metrics concurrently for each test case: https://docs.confident-ai.com/docs/evaluation-test-cases#evaluate-test-cases-in-bulk
It's a big changes so please let me know if you catch any bugs!
This is amazing! Looking forward to try it this week!
@penguine-ip I was trying to run it on my dataset with around 1300 query response pairs. I could see the number of API requests to the custom llm more than 1300. Can you explain the same?
The code FYI:
file_path="query-response-context.json"
class AzureManager:
def __init__(self, context):
self.context = context
self.openai_api_version = self.context['OPENAI_API_VERSION']
self.azure_deployment = self.context['DEPLOYMENT_NAME']
# self.azure_endpoint = self.context['azure_endpoint']
self.model_name = self.context['MODEL_NAME']
self.openai_api_type = self.context['OPENAI_API_TYPE']
self.tenant_id = self.context['TENANT_ID']
self.openai_scope = self.context['OPEN_AI_SCOPE']
self.openai_api_base = self.context['OPENAI_API_BASE']
def get_model_for_evaluation(self):
openai.api_type = self.openai_api_type
token_credential = AzureCliCredential(tenant_id=self.tenant_id)
token = token_credential.get_token(self.openai_scope)
openai.api_key = token.token
custom_model = AzureChatOpenAI(
deployment_name=self.azure_deployment,
model_name=self.model_name,
openai_api_key= openai.api_key,
azure_endpoint=self.openai_api_base,
openai_api_version=self.openai_api_version
)
return AzureOpenAI(model=custom_model)
class AzureOpenAI(DeepEvalBaseLLM):
def __init__(
self,
model
):
self.model = model
def load_model(self):
return self.model
# def _call(self, prompt: str) -> str:
# chat_model = self.load_model()
# return chat_model.invoke(prompt).content
def generate(self, prompt: str) -> str:
chat_model = self.load_model()
return chat_model.invoke(prompt).content
async def a_generate(self, prompt: str) -> str:
chat_model = self.load_model()
res = await chat_model.ainvoke(prompt)
return res.content
def get_model_name(self):
return "Custom Azure OpenAI Model"
def create_dataset_for_llm_metrics(df):
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
def create_llm_test_case(row):
if (not isinstance(row['context'], list)):
if row['context'] == "\n" or row['context'] == None:
row['context'] = ["Not available"]
else:
row['context'] = [str(row['context'])]
test_case = LLMTestCase(
input=row['jsonPayload.UserQuery'],
actual_output=row['jsonPayload.Response'],
expected_output="Not Applicable",
retrieval_context=row['context'],
context=row['context']
)
return test_case
df['test_cases'] = df.apply(lambda x: create_llm_test_case(x), axis=1)
test_cases_list = df['test_cases'].values.tolist()
dataset = EvaluationDataset(test_cases=test_cases_list)
return dataset
df = pd.read_json(file_path)
dataset = create_dataset_for_llm_metrics(df)
azure_openai = AzureManager(params).get_model_for_evaluation()
@pytest.mark.parametrize(
"test_case",
dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
# faith_metric = FaithfulnessMetric(threshold=0.3, model=azure_openai)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5, model=azure_openai)
context_relevance_metric = ContextualRelevancyMetric(threshold=0.7, model=azure_openai)
assert_test(test_case, [answer_relevancy_metric])
@deepeval.on_test_run_end
def function_to_be_called_after_test_run():
print("Test finished!")
Also I notice that some of the test cases are missing in the .json file output and the code never stops executing. Last terminal output:
Hey! I’ve limited web access right now but I have discord available, do you mind copy and pasting it there so I can help out asap? (Can’t view the code and photo)
On Sun, 10 Mar 2024 at 18:14, Abhijeet Pise @.***> wrote:
Also I notice that some of the test cases are missing in the .json file output and the code never stops executing. Last terminal output: Screenshot.2024-03-10.at.15.44.22.png (view on web) https://github.com/confident-ai/deepeval/assets/57705684/e7ca9f2d-f1e1-4b0e-a5ba-e0689486bb93
— Reply to this email directly, view it on GitHub https://github.com/confident-ai/deepeval/issues/568#issuecomment-1987172163, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCFQK6ZO35MLJFYIMQXAPFTYXQXBXAVCNFSM6AAAAABEI2KR52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGE3TEMJWGM . You are receiving this because you were mentioned.Message ID: @.***>
When I create an evaluation dataset using EvaluationDataset and I use it with LLMTestCase, I get this:
I am sure my dataset has all Input, actual output, and retrieval context values..