JohnSnowLabs / langtest

Deliver safe & effective language models
http://langtest.org/
Apache License 2.0
488 stars 36 forks source link

Fix/implement the multiple dataset support for accuracy tests #998

Closed chakravarthik27 closed 5 months ago

chakravarthik27 commented 5 months ago

Description


This pull request makes major enhancements to the langtest library, notably by extending the Harness class to allow testing across several datasets. As the demand for rigorous model evaluation develops, it is critical to guarantee that models perform fairly and accurately across a variety of datasets. This upgrade streamlines the process of benchmarking and comparing models by expanding the Harness class's capabilities to accommodate multiple datasets.

Key Features and Improvements:

Multi-Dataset Support: With the Harness class, users can now seamlessly integrate and test their models across several datasets. This function is intended to help with the full evaluation of models, ensuring that they meet high performance and fairness standards across several data sources.

Enhanced Fairness Testing: The Harness class now evaluates models across several datasets to find and mitigate biases. This assures that models perform equally across varied demographics, furthering the objective of ethical and fair AI.

Improved Accuracy Testing: Multi-dataset support improves accuracy testing in the Harness class, enabling extensive performance evaluation across multiple datasets. This aids in identifying models' strengths and shortcomings, assuring their reliability and effectiveness in real-world applications.

Let's get started:

Initiate the Harness class

harness = Harness(
    task="question-answering",
    model={"model": "gpt-3.5-turbo-instruct", "hub": "openai"},
    data=[
        {"data_source": "NQ-open", "split": "test-tiny",},
        {"data_source": "MedQA", "split": "test-tiny"},
        {"data_source": "LogiQA", "split": "test-tiny"},
    ],
)

Configure the accuracy tests in Harness class

harness.configure(
    {
        "tests": {
            "defaults": {"min_pass_rate": 0.65},

            "accuracy": {
                "llm_eval": {"min_score": 0.60},
                "min_rouge1_score": {"min_score": 0.60},
                "min_rouge2_score": {"min_score": 0.60},
                "min_rougeL_score": {"min_score": 0.60},
                "min_rougeLsum_score": {"min_score": 0.60},
            },
        }
    }
)

harness.generate() generates testcases, .run() executes them, and .report() compiles results.

harness.generate().run().report()

image