Fix/implement the multiple dataset support for accuracy tests

Description

This pull request makes major enhancements to the langtest library, notably by extending the Harness class to allow testing across several datasets. As the demand for rigorous model evaluation develops, it is critical to guarantee that models perform fairly and accurately across a variety of datasets. This upgrade streamlines the process of benchmarking and comparing models by expanding the Harness class's capabilities to accommodate multiple datasets.

Key Features and Improvements:

Multi-Dataset Support: With the Harness class, users can now seamlessly integrate and test their models across several datasets. This function is intended to help with the full evaluation of models, ensuring that they meet high performance and fairness standards across several data sources.

Enhanced Fairness Testing: The Harness class now evaluates models across several datasets to find and mitigate biases. This assures that models perform equally across varied demographics, furthering the objective of ethical and fair AI.

Improved Accuracy Testing: Multi-dataset support improves accuracy testing in the Harness class, enabling extensive performance evaluation across multiple datasets. This aids in identifying models' strengths and shortcomings, assuring their reliability and effectiveness in real-world applications.

Let's get started:

Initiate the Harness class

harness = Harness(
    task="question-answering",
    model={"model": "gpt-3.5-turbo-instruct", "hub": "openai"},
    data=[
        {"data_source": "NQ-open", "split": "test-tiny",},
        {"data_source": "MedQA", "split": "test-tiny"},
        {"data_source": "LogiQA", "split": "test-tiny"},
    ],
)

Configure the accuracy tests in Harness class

harness.configure(
    {
        "tests": {
            "defaults": {"min_pass_rate": 0.65},

            "accuracy": {
                "llm_eval": {"min_score": 0.60},
                "min_rouge1_score": {"min_score": 0.60},
                "min_rouge2_score": {"min_score": 0.60},
                "min_rougeL_score": {"min_score": 0.60},
                "min_rougeLsum_score": {"min_score": 0.60},
            },
        }
    }
)

harness.generate() generates testcases, .run() executes them, and .report() compiles results.

harness.generate().run().report()

JohnSnowLabs / langtest