Technoculture / med-llm-autoeval

Automatically evaluate your LLMs in Google Colab
MIT License
3 stars 1 forks source link

Med LLM AutoEval

Overview

Med LLM AutoEval simplifies the process of evaluating LLMs using a convenient Colab notebook. This tool is ideal for developers who aim to assess the performance of LLMs quickly and efficiently.

Key Features

View a sample summary here.

Note: This project is in the early stages and primarily designed for personal use. Use it carefully and feel free to contribute.

Quick Start

Evaluation parameters

Tokens

Tokens use Colab's Secrets tab. Create two secrets called "runpod" and "github" and add the corresponding tokens you can find as follows:

Benchmark suites

Nous

You can compare your results with:

Open LLM

You can compare your results with those listed on the Open LLM Leaderboard.

Medical

Medical Openllm

Support for arbitrary dspy programs

Introduction

This project aims to provide support for running arbitrary dspy programs with ease. The flexibility is achieved by allowing users to specify different parameters for their dspy program through a simple interface.

Usage

To use this feature, follow these steps:

  1. Import the necessary modules:

    import argparse
    from benchmark import test
    from medprompt import MedpromptModule
  2. Define the parameters for your dspy program using the test function:

    results = test(
    model="MODEL_NAME",
    api_key="API_KEY",
    dspy_module=MedpromptModule,
    benchmark="arc", # arc used as an example here. You can choose any from the above benchmark suites.
    shots=5,
    )
  3. Execute your script.

    Example

import argparse
from benchmark import test
from medprompt import MedpromptModule

if __name__ == "__main__":

    results = test(
        model="google/flan-t5-base",
        api_key="",
        dspy_module=MedpromptModule,
        benchmark="arc",
        shots=5,
    )

You may also refer to video explaining the whole process.

Conclusion

This support for arbitrary dspy programs empowers users to seamlessly integrate and evaluate different programs with varying parameters. By running tests with diverse configurations, users can obtain accuracy scores and compare the performance of different programs. This facilitates informed decision-making, allowing users to identify the program that best suits their specific requirements and use cases.

Troubleshooting

Acknowledgements

Special thanks to EleutherAI for the lm-evaluation-harness, dmahan93 for his fork that adds agieval to the lm-evaluation-harness, NousResearch and Teknium for the Nous benchmark suite, and vllm for the additional inference speed.