Closed iramykytyn closed 1 month ago
As part of our AI model validation initiative, I developed a pipeline to automate the daily testing and validation of large language models (LLMs), specifically Claude and GPT. The goal of this project was to evaluate the performance of prompt executions and ensure that results meet the predefined accuracy thresholds for each model. The pipeline is integrated into a workflow that runs automatically at 00:00 every day, with test results stored as CSV artifacts for further analysis.
Prompt Validation: The pipeline is designed to test the performance of prompts using both Claude and GPT models. Each prompt is executed and evaluated based on its correctness and relevance, ensuring that both models results.
CSV Artifact Generation: After each test run, the results are saved in CSV files, which are stored as artifacts of the workflow. This allows for easy access, monitoring, and future reference of test outcomes.
Precision Calculation: The precision metric for each model is calculated by re-running tests multiple times. This is controlled by the RETRY_COUNT
constant, which determines how many times each test is re-executed to ensure stable results. Precision values are generated based on the repeated test runs, giving us a reliable measurement of the models' performance.
Accuracy Thresholds:
Automated Daily Testing: The pipeline is scheduled to execute every day at 00:00. This ensures continuous monitoring of model performance, allowing us to identify potential degradations in quality and to validate updates or changes in the models' behavior over time.
We need to validate model performance for prompts regularly (for ex. every week). As an input we should provide config ( with prompt and models) and labeled data set in CSV format. For validation we will need to develop validation metrics like precision or F1 score.
As result of this task GA should be setup to run the validation automatically every week or on manual trigger.