STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS)

local file

STBench is a benchmark to evaluate the ability of large language models in spatio-temporal analysis. This benchmark consists of 13 distinct tasks and over 60,000 question-answer pairs, covering four dimensions: knowledge comprehension, spatio-temporal reasoning, accurate computation and downstream applications.

All data samples in STBench are in the form of text completion. An instance is as follows:

Question: Below is the coordinate information and related comments of a point of interest: ... Please answer the category of this point of interest.
Options: (1) xxxx, (2) xxxx, (3) xxxx, ...
Please answer one option.
Answer: The answer is option (

The model is expected to complete the text, i.e., it should generate an option number. Therefore, to benchmark a model with STBench, it is necessary to use a text completion API rather than a chat completion API. For chatting models that only provide chat completion API, we suggest instructing the models to complete the text through the system prompt:

[{"role": "system", "content": "you are a helpful text completion assistant. Please continue writing the text entered by the human."}, {"role": "human", "content": "Question: Below is the coordinate information and related comments of a point of interest: ... Please answer the category of this point of interest.\nOptions: (1) xxxx, (2) xxxx, (3) xxxx, ...\nPlease answer one option.\nAnswer: The answer is option ("}]

Quick Start

We have benchmarked 13 distinct large language models and here we provide a simple guide to reproduce our experiments.

Dependency Installation

Run the following command to install dependencies:
```
pip install -r requirements.txt
```
Model Downloading

Our experiments about open-source models are based on modelscope and these open-source models can be downloaded by following command:
```
cd code
python downloads_llms.py
```
Basic Prompt

Run the following command to benchmark all models through 13 tasks:
```
python basic_prompting.py
```
In-Context Learning

Run the following command to evaluate the performance of all models with in-context learning:
```
python icl_prompting.py
```
Chain-of-Thought Prompting

To conduct experiments with chain-of-thought prompting for all models, run the following command:
```
python cot_prompting.py
```
Fine-tuning

Run the following command to fine-tune the model and evaluate the fine-tuned model:
```
python fine_tuning.py
```

Detailed Usage

This repository is organized as follows:

Project
  |—— LICENSE
  |—— overview.png
  |—— README.md
  |—— requirements.txt
  |—— datasets                  # all datasets can be found in this directory
      |—— basic                 # the main datasets of STBench, consists of over 60,000 QA pairs
      |—— icl                   # two samples for each task to perform two-shot prompting
      |—— cot                   # two samples containing reasoning for each task to perform CoT prompting
      |—— sft                   # training datasets and validation datasets for fine-tuning
  |—— code
      |—— model_inference       # calling the API of each large language model
      |—— model_finetuning      # fine-tuning code
      |—— download_llms.py      # downloading open-source models
      |—— basic_prompting.py    # running experiments with basic prompting
      |—— icl_prompting.py      # running experiments with icl prompting
      |—— cot_prompting.py      # running experiments with cot prompting
      |—— fine_tuning.py        # running experiments with fine-tuning
      |—— result_parser.py      # code for identifying the final answer of the model
      |—— config.py             # a declaration of some configuration such as the file path for each task

To benchmark a new model, namely NEW_MODEL

a. Write your code for calling the API of this model in code/model_inference/new_model.py, and modify code/model_inference/__init__.py accordingly.

b. Add the model to the model list in code/basic_prompting.py
To include a new dataset, namely new_dataset.jsonl, for a task NEW_TASK

a. Put your datasets here: dataset/basic/new_dataset.jsonl

b. Modify code/result_parser.py and implement your function new_task_parser() to parse the results from the output of the LLMs

c. Modify code/config.py to specify the mapping from NEW_TASK to the dataset path dataset/basic/new_dataset.jsonl and the mapping from NEW_TASK to the result parser new_task_parser()

d. Add the task to the task list in code/basic_prompting.py

Experimental Results

	Knowledge Comprehension				Spatio-temporal Reasoning				Accurate Computation		Downstream Applications
	PCR	PI	URFR	ARD	PTRD	PRRD	TRRD	TI	DD	TTRA	TAD	TC	TP
ChatGPT	0.7926	0.5864	0.3978	0.8358	0.7525	0.9240	0.0258	0.3342	0.1698	0.1048	0.5382	0.4475	-
GPT-4o	0.9588	0.7268	0.6026	0.9656	-	0.9188	0.1102	0.4416	0.5434	0.3404	0.6016	-	-
ChatGLM2	0.2938	0.5004	0.2661	0.2176	0.2036	0.5216	0.2790	0.5000	0.1182	0.1992	0.5000	0.3333	231.2
ChatGLM3	0.4342	0.5272	0.2704	0.2872	0.3058	0.8244	0.1978	0.6842	0.1156	0.1828	0.5000	0.3111	224.5
Phi-2	-	0.5267	-	0.2988	-	-	-	0.5000	0.1182	0.0658	0.5000	0.3333	206.9
Llama-2-7B	0.2146	0.4790	0.2105	0.2198	0.2802	0.6606	0.2034	0.5486	0.1256	0.2062	0.5098	0.3333	189.3
Vicuna-7B	0.3858	0.5836	0.2063	0.2212	0.3470	0.7080	0.1968	0.5000	0.1106	0.1728	0.5000	0.2558	188.1
Gemma-2B	0.2116	0.5000	0.1989	0.1938	0.4688	0.5744	0.2014	0.5000	0.1972	0.2038	0.5000	0.3333	207.7
Gemma-7B	0.4462	0.5000	0.2258	0.2652	0.3782	0.9044	0.1992	0.5000	0.1182	0.1426	0.5000	0.3333	139.4
DeepSeek-7B	0.2160	0.4708	0.2071	0.1938	0.2142	0.6424	0.1173	0.4964	0.1972	0.1646	0.5000	0.3333	220.8
Falcon-7B	0.1888	0.5112	0.1929	0.1928	0.1918	0.4222	0.2061	0.7072	0.1365	0.2124	0.5000	0.3309	3572.8
Mistral-7B	0.3526	0.4918	0.2168	0.3014	0.4476	0.7098	0.0702	0.4376	0.1182	0.1094	0.5000	0.3333	156.8
Qwen-7B	0.2504	0.6795	0.2569	0.2282	0.2272	0.5762	0.1661	0.4787	0.1324	0.2424	0.5049	0.3477	205.2
Yi-6B	0.3576	0.5052	0.2149	0.1880	0.5536	0.8264	0.1979	0.5722	0.1284	0.2214	0.5000	0.3333	156.2

LwbXc / STBench

readme

STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

Quick Start

Detailed Usage

Experimental Results