JOHNNY-fans / MedOdyssey

Apache License 2.0
4 stars 0 forks source link

MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens

Introduction

Welcome to MedOdyssey, a medical long-context benchmark with seven length levels ranging from 4K to 200K tokens. MedOdyssey consists of two primary components: the medical-context "Needles in a Haystack'' task and a series of tasks specific to medical applications, together comprising 10 datasets. Here is the Architecture of MedOdyssey.

Dataset Statistics

Task Annotation # Examples Avg. Len MIC NFI CIR Eval Metrics
En.NIAH Auto & Human 20×7×5 179.2k/32 Acc.
Zh.NIAH Auto & Human 20×7×5 45.6k/10.2 Acc.
En.Counting Auto 4×7 179.0k/13.6 Acc.
Zh.Counting Auto 4×7 45.6k/12.3 Acc.
En.KG Auto & Human 100 186.4k/68.8 P., R., F1.
Zh.KG Auto & Human 100 42.5k/2.0 P., R., F1.
En.Term Auto 100 183.1k/11.7 Acc.
Zh.Term Auto 100 32.6k/7.0 Acc.
Zh.Case Auto & Human 100 47.7k/1.3 Acc.
Zh.Table Auto & Human 100 53.6k/1.4 P., R., F1.

Here are the dataset statistics, where "MIC" is short for Maximum Identical Context, "NFI" is short for Novel Facts Injection, and "CIR" is short for Counter-intuitive Reasoning.

Baselines

We researched current state-of-the-art long-context LLMs and presented the performance of two kinds of baseline LLMs in MedOdyssey. For closed-source commercial LLMs, we call the official APIs to get the responses for each task. We also deployed open-source models for inference on our own. The LLMs and versions we selected are as follows:

Overall Evaluation Results

Main Results of Needles in a Haystack

Notes: The default is the exact string-matching strategy and SSM is the subset string-matching strategy.