KwanWaiChung / M4LE

Code for M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models
MIT License
22 stars 0 forks source link

M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models


📃 Paper 🤗 Dataset

📢 Update

[2024/07/27] We released the dataset that supports up to 128K. We also released the code for constructing the test instances. See 5.2. Data Creation for more details.
[2024/05/16] Our paper is accepted to ACL 2024 main.
[2023/10/31] We released the M4LE paper.


📚 Content


📘 1. Introduction [Back to Top]

M4LE is a Multi-ability, Multi-range, Multi-task, bilingual benchmark for long-context evaluation. We categorize long-context understanding into five distinct abilities by considering whether it is required to identify single or multiple spans in long contexts based on explicit or semantic hints. Specifically, these abilities are explicit single-span, semantic single-span, explicit multiple-span, semantic multiple-span, and global. Different from previous long-context benchmarks that simply compile from a set of existing long NLP benchmarks, we introduce an automated method to transform short-sequence tasks into a comprehensive long-sequence scenario encompassing all these capabilities.

M4LE consists of 36 tasks, covering 11 task types and 12 domains. For each task, we construct 200 instances for each context length bucket (1K, 2K, 4K, 6K, 8K, 12K, 16K, 24K, 32K). Due to computation and cost constraints, our paper evaluated 11 well-established LLMs on instances up to the 8K context length bucket. For more details, please refer to the paper.


This figure compres M4LE with existing long-context benchmarks.

🏆 2. Leaderboard [Back to Top]

For each task, the score is normalized by the score of GPT-3.5-Turbo-16K on the 1K context. The equation for normalization can be found in the paper. We use the average normalized score obtained from context length buckets of 1K, 2K, 4K, 6K, and 8K to compare different models.

Model 1k 2k 4k 6k 8k Avg. Score
GPT-3.5-Turbo-16K 0.50 0.47 0.42 0.39 0.36 0.43
Vicuna-13B-v1.5-16K 0.48 0.45 0.40 0.33 0.23 0.38
LongChat-7B-v1.5-32K 0.41 0.37 0.33 0.27 0.24 0.32
LongChat-13B-16K 0.41 0.37 0.33 0.25 0.21 0.31
ChatGLM2-6B-32K 0.39 0.34 0.30 0.27 0.26 0.31
Vicuna-7B-v1.5-16K 0.43 0.39 0.32 0.26 0.14 0.31
LLaMA2-13B-Chat 0.44 0.37 0.28 0.23 0.21 0.31
LLaMA2-13B 0.44 0.38 0.28 0.22 0.19 0.30
ChatGLM2-6B 0.40 0.33 0.25 0.21 0.17 0.27
LLaMA2-7B-Chat 0.41 0.34 0.25 0.13 0.11 0.25
LLaMA2-7B 0.39 0.33 0.25 0.13 0.07 0.23

📊 3. Results [Back to Top]


The normalized scores of various models in different context lengths (left), accompanied by the slopes of the corresponding best-fit lines (right).

🛠 4.️ Setup [Back to Top]

The following command sets up a conda environment for inference and evaluation.

conda env create --name m4le --file environment.yml

🗂️ 5. Data [Back to Top]

5.1. Load Data

As described in the paper, we adopt different open-sourced datasets to construct 36 tasks in M4LE. Each task has a jsonl file containing all the testing instances. You can load the data from the huggingface hub:

from datasets import load_dataset
tasks = [
    "arxiv",
    "bigpatent_global_cls",
    "bigpatent_global_sum",
    "booksum",
    "c3",
    "cepsum",
    "clts+",
    "cnewsum",
    "cnnnews",
    "drcd_explicit-single",
    "drcd_semantic-single",
    "duorc",
    "dureader",
    "hotpotqa",
    "lcsts",
    "marc",
    "mnds-news_explicit-single",
    "mnds-news_explicit-multiple",
    "mnds-news_semantic-multiple",
    "ncls",
    "news-commentary-en2zh",
    "news-commentary-zh2en",
    "news2016",
    "newsqa",
    "nq-open",
    "online-shopping",
    "open-subtitles-en2zh",
    "open-subtitles-zh2en",
    "pubmed",
    "tedtalks-en2zh",
    "tedtalks-zh2en",
    "thucnews_explicit-single",
    "thucnews_explicit-multiple",
    "thucnews_semantic-multiple",
    "triviaqa",
    "wiki2019zh",
    "wikihow",
    "wikitext-103",
    "wow",
]

for task in tasks:
    data = load_dataset('wckwan/M4LE', task, split='test')

Each testing instance follows this format:

{
    "instruction": "<task description>",
    "input": "<task input with one-shot example>",
    "answers": ["<answer1>", "<answer2>"],
    "input_length": <int, number of words in instruction and input separated by space>,
    "total_length": <int, number of words in instruction, input and gold answer separated by space>,
    "length_bucket": <int, the length bucket to which this instance belongs>
}

5.2. Data Creation

We also provide the scripts to construct the instances from the raw datasets.
First, download and preprocess the data. It downloads many source datasets which takes lots of time and space. It is possible to encounter network errors during the download process where you have to edit the script to resume where you left off.

./raw_data/preprocess.sh
If the preprocessing script is successful, you should have the following folder structure ``` raw_data/ ├── classification/ │ ├── THUCNews/ │ ├── arxiv-dataset/ │ ├── marc/ │ └── online_shopping_10_cats/ ├── nli/ │ ├── wiki_zh/ │ └── wikitext-2-raw/ ├── qa/ │ ├── DRCD/ │ ├── __MACOSX/ │ │ └── newsqa-data-v1/ │ ├── c3/ │ ├── duorc/ │ │ ├── dataset/ │ │ └── preprocessing/ │ │ ├── data/ │ │ └── utils/ │ ├── dureader/ │ │ ├── devset/ │ │ ├── evaluation_metric/ │ │ ├── testset/ │ │ └── trainset/ │ ├── hotpotqa/ │ ├── natural_questions/ │ ├── newsqa/ │ │ └── stories/ │ └── triviaqa/ │ ├── evidence/ │ │ ├── web/ │ │ └── wikipedia/ │ ├── qa/ │ └── triviaqa-unfiltered/ ├── summarization/ │ ├── CEPSUM/ │ ├── CNewSum_v2/ │ │ └── final/ │ ├── NCLS-Data/ │ │ ├── EN2ZHSUM/ │ │ └── ZH2ENSUM/ │ ├── QMSum/ │ │ ├── data/ │ │ │ ├── ALL/ │ │ │ ├── Academic/ │ │ │ ├── Committee/ │ │ │ └── Product/ │ │ ├── extracted_span/ │ │ ├── figures/ │ │ └── model_output/ │ ├── arxiv-dataset/ │ ├── bigPatentData/ │ │ ├── test/ │ │ ├── train/ │ │ └── val/ │ ├── clts/ │ ├── cnn/ │ │ └── stories/ │ ├── gov-report/ │ │ ├── crs/ │ │ ├── gao/ │ │ └── split_ids/ │ ├── lcsts/ │ ├── news2016zh/ │ ├── pubmed-dataset/ │ └── wikihow/ ├── translation/ │ ├── News-Commentary_v16/ │ │ ├── en/ │ │ │ └── News-Commentary/ │ │ │ └── xml/ │ │ │ └── en/ │ │ └── zh/ │ │ └── News-Commentary/ │ │ └── xml/ │ │ └── zh/ │ ├── opensubtitles/ │ │ └── OpenSubtitles/ │ │ └── xml/ │ │ ├── en/ │ │ ├── zh_cn/ │ └── tedtalk/ └── wow/ ```

Construct the test instances with the following script. To customize the length buckets, edit the variable buckets in create_data.sh

./data/create_data.sh

📝 6. Task [Back to Top]

Ability Task Name Task Type Language Description
Explicit Single mnds-news_explicit-single CLS + RET En Classify a specified news article.
Explicit Single thucnews_explicit-single CLS + RET Zh Classify a specified news article.
Explicit Single newsqa QA + RET En Answer a question based on a specified news article.
Explicit Single c3 QA + RET Zh Answer a multi-choice question based on a textbook extract.
Explicit Single wow RET En Return the ID of the article related to a specified topic.
Explicit Single drcd_explicit-single RET Zh Return the ID of the article related to a specified topic.
Explicit Single cnnnews SUM + RET En Summarize a specified news article.
Explicit Single cepsum SUM + RET Zh Summarize a specified product description.
Explicit Single lcsts SUM + RET Zh Summarize a specified news article.
Explicit Single ncls SUM + RET En, Zh Summarize a specified news article.
Explicit Multiple mnds-news_explicit-multiple CLS + RET En Return the IDs of all the articles belong to a specified class.
Explicit Multiple thucnews_explicit-multiple CLS + RET Zh Return the IDs of all the articles belong to a specified class.
Explicit Multiple marc CLS + RET En, Zh Return the IDs of all the positive product reviews.
Explicit Multiple online-shopping CLS + RET Zh Return the IDs of all the positive product reviews.
Semantic Single wikitext-103 NLI + RET En Return the ID of the paragraph that continues a query paragraph.
Semantic Single wiki2019zh NLI + RET Zh Return the ID of the paragraph that continues a query paragraph.
Semantic Single duorc QA En Answer a question based on multiple movie plots.
Semantic Single nq-open QA En Answer a question based on multiple wikipedia paragraphs.
Semantic Single dureader QA Zh Answer a question based on multiple web snippets.
Semantic Single drcd_semantic-single QA Zh Answer a question based on multiple wikipedia paragraphs.
Semantic Single wikihow SUM + RET En Summarize an article based on a given topic.
Semantic Single news2016 SUM + RET Zh Summarize a news article based on a given title.
Semantic Single tedtalks-en2zh/tedtalks-zh2en TRAN + RET En, Zh Translate a Ted Talk transcript based on a given title.
Semantic Multiple mnds-news_semantic-multiple CLS + CNT En Return the number of news articles belonging to a specified class.
Semantic Multiple thucnews_semantic-multiple CLS + CNT Zh Return the number of news articles belonging to a specified class.
Semantic Multiple hotpotqa QA En Answer a question based on multiple wikipedia paragraphs.
Global bigpatent_global_cls CLS En Classify a patent document.
Global triviaqa QA En Answer a question based on a web snippet.
Global arxiv SUM En Summarize an academic paper.
Global bigpatent_global_sum SUM En Summarize a patent document.
Global pubmed SUM En Summarize a medical paper.
Global booksum SUM En Summarize one or more chapters of a book.
Global cnewsum SUM Zh Summarize a news article.
Global clts+ SUM Zh Summarize a news article.
Global open-subtitles-en2zh/open-subtitles-zh2en TRAN En, Zh Translate the movie subtitles.
Global news-commentary-en2zh/news-commentary-zh2en TRAN En, Zh Translate the movie subtitles.

🧠 7. Inference [Back to Top]

In the paper, we use the prompt format f"{instruction}\n{input}". We recommend editing or using our `inference.py`` code. To run the code, provide the HuggingFace model name and task name:

python inference.py \
--model_path lmsys/vicuna-13b-v1.5-16k \
--task nq-open

For LLaMA2 models, we recommend enabling dynamic NTK scaling:

python inference.py \
--model_path meta-llama/Llama-2-7b \
--task nq-open \
--load_model_args '{"rope_scaling": {"type": "dynamic", "factor": 2.0}}'

Additional arguments:

The predictions will be saved at outputs/{model}/{task}.jsonl


📈 8. Evaluation [Back to Top]

Run the following script to evaluate:

python evaluation.py

It will evaluate all the models saved at outputs/{model}/. The results will be saved at results/all_results.csv, where each row contains


📄 Citation

If you find our paper and resources useful, please consider citing our paper:

@misc{kwan_m4le_2023,
  title = {{{M4LE}}: {{A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark}} for {{Large Language Models}}},
  author = {Kwan, Wai-Chung and Zeng, Xingshan and Wang, Yufei and Sun, Yusen and Li, Liangyou and Shang, Lifeng and Liu, Qun and Wong, Kam-Fai},
  year = {2023},
}