Feature/implement load & save for benchmark reports

Description

This pull request introduces a significant upgrade to LangTest's evaluation capabilities, focusing on report management and leaderboards. These enhancements empower you to:

Streamlined Reporting and Tracking: Effortlessly save and load detailed evaluation reports directly from the command line using langtest eval, enabling efficient performance tracking and comparative analysis over time, with manual file review options in the ~/.langtest or ./.langtest folder.
Enhanced Leaderboards: Gain valuable insights with the new langtest show-leaderboard command. This command displays existing leaderboards, providing a centralized view of ranked model performance across evaluations.
Average Model Ranking: Leaderboard now include the average ranking for each evaluated model. This metric provides a comprehensive understanding of model performance across various datasets and tests.

How it works:

First, create the parameter.json or parameter.yaml in the working directory

JSON Format

{
    "task": "question-answering",
    "model": {
        "model": "http://localhost:1234/v1/chat/completions",
        "hub": "lm-studio"
    },
    "data": [
        {
            "data_source": "MedMCQA"
        },
        {
            "data_source": "PubMedQA"
        },
        {
            "data_source": "MMLU"
        },
        {
            "data_source": "MedQA"
        }
    ],
    "config": {
        "model_parameters": {
            "max_tokens": 64
        },
        "tests": {
            "defaults": {
                "min_pass_rate": 1.0
            },
            "robustness": {
                "add_typo": {
                    "min_pass_rate": 0.70
                }
            },
            "accuracy": {
                "llm_eval": {
                    "min_score": 0.60
                }

            }
        }
    }
}

Yaml Format

task: question-answering
model:
  model: http://localhost:1234/v1/chat/completions
  hub: lm-studio
data:
- data_source: MedMCQA
- data_source: PubMedQA
- data_source: MMLU
- data_source: MedQA
config:
  model_parameters:
    max_tokens: 64
  tests:
    defaults:
      min_pass_rate: 1
    robustness:
      add_typo:
        min_pass_rate: 0.7
    accuracy:
      llm_eval:
        min_score: 0.6

And open the terminal or cmd in your system

langtest eval --model <your model name or endpoint> \
              --hub <model hub like hugging face, lm-studio, web ...> \
              -c < your configuration file like parameter.json or parameter.yaml>

Finally, we can know the leaderboard and rank of the model.

To visualize the leaderboard anytime using the CLI command

langtest show-leaderboard

JohnSnowLabs / langtest

Feature/implement load & save for benchmark reports #999

Description

How it works: