LwbXc / STBench

MIT License
5 stars 0 forks source link

STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS)

πŸ€— Hugging Face Dataset β€’ πŸ“ƒ Paper

local file

STBench is a benchmark to evaluate the ability of large language models in spatio-temporal analysis. This benchmark consists of 13 distinct tasks and over 60,000 question-answer pairs, covering four dimensions: knowledge comprehension, spatio-temporal reasoning, accurate computation and downstream applications.

All data samples in STBench are in the form of text completion. An instance is as follows:

Question: Below is the coordinate information and related comments of a point of interest: ... Please answer the category of this point of interest.
Options: (1) xxxx, (2) xxxx, (3) xxxx, ...
Please answer one option.
Answer: The answer is option (

The model is expected to complete the text, i.e., it should generate an option number. Therefore, to benchmark a model with STBench, it is necessary to use a text completion API rather than a chat completion API. For chatting models that only provide chat completion API, we suggest instructing the models to complete the text through the system prompt:

[{"role": "system", "content": "you are a helpful text completion assistant. Please continue writing the text entered by the human."}, {"role": "human", "content": "Question: Below is the coordinate information and related comments of a point of interest: ... Please answer the category of this point of interest.\nOptions: (1) xxxx, (2) xxxx, (3) xxxx, ...\nPlease answer one option.\nAnswer: The answer is option ("}]

Quick Start

We have benchmarked 13 distinct large language models and here we provide a simple guide to reproduce our experiments.

  1. Dependency Installation

    Run the following command to install dependencies:

    pip install -r requirements.txt
  2. Model Downloading

    Our experiments about open-source models are based on modelscope and these open-source models can be downloaded by following command:

    cd code
    python downloads_llms.py
  3. Basic Prompt

    Run the following command to benchmark all models through 13 tasks:

    python basic_prompting.py
  4. In-Context Learning

    Run the following command to evaluate the performance of all models with in-context learning:

    python icl_prompting.py
  5. Chain-of-Thought Prompting

    To conduct experiments with chain-of-thought prompting for all models, run the following command:

    python cot_prompting.py
  6. Fine-tuning

    Run the following command to fine-tune the model and evaluate the fine-tuned model:

    python fine_tuning.py

Detailed Usage

This repository is organized as follows:

Project
  |β€”β€” LICENSE
  |β€”β€” overview.png
  |β€”β€” README.md
  |β€”β€” requirements.txt
  |β€”β€” datasets                  # all datasets can be found in this directory
      |β€”β€” basic                 # the main datasets of STBench, consists of over 60,000 QA pairs
      |β€”β€” icl                   # two samples for each task to perform two-shot prompting
      |β€”β€” cot                   # two samples containing reasoning for each task to perform CoT prompting
      |β€”β€” sft                   # training datasets and validation datasets for fine-tuning
  |β€”β€” code
      |β€”β€” model_inference       # calling the API of each large language model
      |β€”β€” model_finetuning      # fine-tuning code
      |β€”β€” download_llms.py      # downloading open-source models
      |β€”β€” basic_prompting.py    # running experiments with basic prompting
      |β€”β€” icl_prompting.py      # running experiments with icl prompting
      |β€”β€” cot_prompting.py      # running experiments with cot prompting
      |β€”β€” fine_tuning.py        # running experiments with fine-tuning
      |β€”β€” result_parser.py      # code for identifying the final answer of the model
      |β€”β€” config.py             # a declaration of some configuration such as the file path for each task      
  1. To benchmark a new model, namely NEW_MODEL

    a. Write your code for calling the API of this model in code/model_inference/new_model.py, and modify code/model_inference/__init__.py accordingly.

    b. Add the model to the model list in code/basic_prompting.py

  2. To include a new dataset, namely new_dataset.jsonl, for a task NEW_TASK

    a. Put your datasets here: dataset/basic/new_dataset.jsonl

    b. Modify code/result_parser.py and implement your function new_task_parser() to parse the results from the output of the LLMs

    c. Modify code/config.py to specify the mapping from NEW_TASK to the dataset path dataset/basic/new_dataset.jsonl and the mapping from NEW_TASK to the result parser new_task_parser()

    d. Add the task to the task list in code/basic_prompting.py

Experimental Results

Knowledge Comprehension Spatio-temporal Reasoning Accurate Computation Downstream Applications
PCRPIURFRARDPTRDPRRDTRRDTIDDTTRATADTCTP
ChatGPT 0.7926 0.5864 0.3978 0.8358 0.7525 0.9240 0.0258 0.3342 0.1698 0.1048 0.5382 0.4475 -
GPT-4o 0.9588 0.7268 0.6026 0.9656 - 0.9188 0.1102 0.4416 0.5434 0.3404 0.6016 - -
ChatGLM2 0.2938 0.5004 0.2661 0.2176 0.2036 0.5216 0.2790 0.5000 0.1182 0.1992 0.5000 0.3333 231.2
ChatGLM3 0.4342 0.5272 0.2704 0.2872 0.3058 0.8244 0.1978 0.6842 0.1156 0.1828 0.5000 0.3111 224.5
Phi-2 - 0.5267 - 0.2988 - - - 0.5000 0.1182 0.0658 0.5000 0.3333 206.9
Llama-2-7B 0.2146 0.4790 0.2105 0.2198 0.2802 0.6606 0.2034 0.5486 0.1256 0.2062 0.5098 0.3333 189.3
Vicuna-7B 0.3858 0.5836 0.2063 0.2212 0.3470 0.7080 0.1968 0.5000 0.1106 0.1728 0.5000 0.2558 188.1
Gemma-2B 0.2116 0.5000 0.1989 0.1938 0.4688 0.5744 0.2014 0.5000 0.1972 0.2038 0.5000 0.3333 207.7
Gemma-7B 0.4462 0.5000 0.2258 0.2652 0.3782 0.9044 0.1992 0.5000 0.1182 0.1426 0.5000 0.3333 139.4
DeepSeek-7B 0.2160 0.4708 0.2071 0.1938 0.2142 0.6424 0.1173 0.4964 0.1972 0.1646 0.5000 0.3333 220.8
Falcon-7B 0.1888 0.5112 0.1929 0.1928 0.1918 0.4222 0.2061 0.7072 0.1365 0.2124 0.5000 0.3309 3572.8
Mistral-7B 0.3526 0.4918 0.2168 0.3014 0.4476 0.7098 0.0702 0.4376 0.1182 0.1094 0.5000 0.3333 156.8
Qwen-7B 0.2504 0.6795 0.2569 0.2282 0.2272 0.5762 0.1661 0.4787 0.1324 0.2424 0.5049 0.3477 205.2
Yi-6B 0.3576 0.5052 0.2149 0.1880 0.5536 0.8264 0.1979 0.5722 0.1284 0.2214 0.5000 0.3333 156.2