confident-ai / deepeval

The LLM Evaluation Framework
https://docs.confident-ai.com/
Apache License 2.0
3.78k stars 300 forks source link

Issue regarding consistent loading of source_files data (from goldens original values) #1173

Open CAW-nz opened 5 days ago

CAW-nz commented 5 days ago

I also want to report an issue regarding consistent propagation of source_files data through the file save and load functions.

synthesizer.generate_goldens_from_docs and generate_goldens_from_contexts captures source_files info in the data structure, and this is saved as part of synthesizer.save_as and dataset.save_as but unfortunately the relevant load routines don't consume this data and bring them back during the EvaluationDataset.add_test_cases_from_json_file and add_test_cases_from_csv_file methods. There is a similar issue for EvaluationDataset.add_goldens_from_json_file and add_goldens_from_csv_file methods, although in these versions the file_path is put into the source_files storage (which is fine if the original file didn't have the data but otherwise I believe it should take the data from the source_files info saved in the actual file). These two methods could be enhanced to do this context based setting for this data.

In summary: These "add_testcases_from_xxx_file" methods currently do not load this data even if it exists in the source files (from when the data was created as goldens). These "add_goldens_from_xxx_file" methods use the file_path data instead (even when the underlying files have specific source_files info). I guess source_files was a recent addition and while the save parts got added the load parts got missed??


Related issue re test_case/llm_test_case.py for class LLMTestCase: LLMTestCase doesn't support source_file so I had to comment out the line reference I'd added in below code even through the data is supported as a golden and could easily be part of test_cases structure too (to preserve the source info).

# Here is a convert+populate actual_output routine I wrote (based on one of your examples) but I'm blocked from transferring the source_file data from golden to test_cases because its not supported for LLMTestCase class.
def convert_goldens_to_test_cases(goldens: List[Golden]) -> List[LLMTestCase]:
    test_cases = []
    for golden in goldens:
        test_case = LLMTestCase(
            input=golden.input,
            actual_output=my_test_llm.generate(golden.input),  # **** Generate actual output using the 'input' ****
            expected_output=golden.expected_output,
            context=golden.context,
            retrieval_context=golden.retrieval_context,        # my_test_llm is not capable of returning this, just copy if already there from json
            additional_metadata=golden.additional_metadata,
            comments=golden.comments,
            tools_called=golden.tools_called,
            expected_tools=golden.expected_tools,
#            source_file=golden.source_file     # current version does not support this data even though part of golden
        )
        test_cases.append(test_case)
    return test_cases

FYI - The 2 supporting routines in dataset/utils.py - convert_test_cases_to_goldens() and convert_goldens_to_test_cases() would need to be enhanced to support source_file copying between these two structures as well.

kritinv commented 3 days ago

Loading of source files when adding goldens to dataset is now supported in PR: https://github.com/confident-ai/deepeval/pull/1178/files. In terms of test cases, a work around is to store the source_file within the additional metadata field. since you see that most of the fields in LLMTestCase are parameters used for evaluation. Can I ask what you're trying to use the source file for when creating test cases?

CAW-nz commented 1 day ago

@kritinv Thanks for making the changes for the two add goldens from file methods, but one part isn't quite right yet.

By the way - although this really is more appropriate for Issue #1171, it's nice that you've customized the encoding string for the open statements, so that "utf-8" is the default but could be set to something else if necessary. Note that the 2 open statements in dataset.save_as method still have no "utf-8" default - but that is the topic for Issue #1171 to resolve.


Regarding your question about source file when creating test cases. I don't have any particular use case I'm specifically trying to address. I was really working through a few examples to get familiar with the capability of DeepEval and this bit seemed a bit strange to hold the data as a golden but not allow it to be retained as a test case. I'm thinking of testing a RAG and generating test cases from my source files. Obviously I generate the golden using DeepEval, but then I have to create the Test Case with my LLM's answer and then want to save the TC in a 'ready to be evaluated' format. (Saving allows the TC to be modified if necessary - such as adding supplementary questions from a SME (+its related 'answer'), and more importantly reloaded without having to go thru the full generate cycle each time.) It's odd that I can't include the source file of the question in my completed TC info that I'm then going to load and evaluate with DeepEval metric analysis. It's not an issue when you only have one source file (as I currently have) - but I can foresee an issue if I had lots of source files (at least when wanting to directly look back at the true source context info for any given question/answer pair).

Yes you are right it could be put into the metadata (as long as that support custom structure - which I guess it does), but I wanted to use the standard save_as routines. Neither synthesizer.save_as or dataset.save_as methods save metadata (and they both only support saving goldens but I see that saving test cases is a TODO). But (for what I described above) rather than use metadata, I'd simply change my populate LLM answer routine to populate the golden, and then save it directly. Now that you've added support for loading source_file data we can cycle between save and load without loss - as long as it stays in 'goldens' structure. But my workaound still wouldn't solve the issue of loss of source_file if I'm wanting to save test_case info (such as metric evaluation results) after doing evaluate calls.