https://github.com/epfLLM/meditron/blob/main/evaluation/benchmarks.py
https://github.com/epfLLM/meditron/blob/main/evaluation/README.md

Python code

Medical Competence

MedMCQA
PubMedQA
MedQA
MedicationQA
MMLU Medical

Reasoning

ARC
HellaSwag
Winogrande
GSM8K
TruthfulQA
Blurb: Biomedical Language Understanding and Reasoning Benchmark

Final code involves a CLI for testing a given model on a given list of evals. This CLI is called by the runpod.sh program. Which in turn is called by the runpod template after the GPU instance boots up.

Finally in a colab notebook if the following sort of code runs to completion, then we test it with runpod in debug mode.

!export DEBUG=DEBUG
!export GITHUB_API_TOKEN=GITHUB_API_TOKEN
...

!git clone https://github.com/Technoculture/med-llm-autoeval.git
!bash ./med-llm-autoeval/runpod.sh

Technoculture / med-llm-autoeval

Add medical evals #1

Python code

Medical Competence

Reasoning