Implement script to run periodic validation of prompt and integrate it into GitHub Actions

Project Report: LLM Validation Pipeline for Claude and GPT Models

Project Overview

As part of our AI model validation initiative, I developed a pipeline to automate the daily testing and validation of large language models (LLMs), specifically Claude and GPT. The goal of this project was to evaluate the performance of prompt executions and ensure that results meet the predefined accuracy thresholds for each model. The pipeline is integrated into a workflow that runs automatically at 00:00 every day, with test results stored as CSV artifacts for further analysis.

Key Features of the Pipeline:

Prompt Validation: The pipeline is designed to test the performance of prompts using both Claude and GPT models. Each prompt is executed and evaluated based on its correctness and relevance, ensuring that both models results.
CSV Artifact Generation: After each test run, the results are saved in CSV files, which are stored as artifacts of the workflow. This allows for easy access, monitoring, and future reference of test outcomes.
Precision Calculation: The precision metric for each model is calculated by re-running tests multiple times. This is controlled by the RETRY_COUNT constant, which determines how many times each test is re-executed to ensure stable results. Precision values are generated based on the repeated test runs, giving us a reliable measurement of the models' performance.
Accuracy Thresholds:
- Claude: The required precision threshold for the Claude model is set to 80%
- GPT: For the GPT model, the precision threshold is set to 70%
Automated Daily Testing: The pipeline is scheduled to execute every day at 00:00. This ensures continuous monitoring of model performance, allowing us to identify potential degradations in quality and to validate updates or changes in the models' behavior over time.

COXIT-CO / dont_trust_ai

Implement script to run periodic validation of prompt and integrate it into GitHub Actions #3

Project Report: LLM Validation Pipeline for Claude and GPT Models

Project Overview

Key Features of the Pipeline: