Benchmarking causal discovery using ChatGPT: The cause-effect pairs challenge

Does A cause B? Or does B cause A?

Pairwise causal discovery is a fundamental open problem. Given two variables, the task is to determine which variable causes the other. As one of the key benchmarks for this task, Mooij et al. (2016) released the Tuebingen cause-effect pairs dataset with 108 pairs of real world variables.

As a fun exploration, we present these pairs of variables as prompts to ChatGPT to study the capabilities of large language models in inferring causality. ChatGPT performs significantly better than current SoTA algorithms on the Tuebingen benchmark. In the 74 pairs we have tried so far, ChatGPT obtains an accuracy of 92.5%. In comparison, the best known accuracy using conventional discovery methods is 70-80% [Mooij et al. (2016), Tagasovska et al. (2020), Compton et al. (2020), Salem et al. (2022)].

Crucially, ChatGPT does not need access to the data for each variable. It can infer causality simply from the variable names. We use the following prompt for each cause-effect pair:

Does changing [varA] cause a change in [varB]? Please answer in a single word: Yes or No.

We adopt the following protocol:

Fetch the README.txt file from the Tuebingen benchmark website.
Use the variable names provided in the README file. In case the variable names are ambiguous, refer to the dataset description provided on the same webpage and choose a descriptive variable name.
Input two prompts to ChatGPT, one for causality from A to B, and another for causality from B to A. Record whether the answers are correct (1) or not (0).
The accuracy is the average of the answers to the two questions.

This repository contains four files:

results.txt: A csv file containing the results for each cause-effect pair. The first two columns signify the result of Does A cause B, and Does B cause A, respectively. 1 means that ChatGPT outputted the correct answer and 0 means it outputted the incorrect answer. This file is based on the README.txt file provided by Tuebingen benchmark.
prompts.txt: For reproducibility, we provide the example prompt used for each cause-effect pair.
pairmeta.txt: This file contains the recommended weights to be used when computing the overall accuracy on the benchmark.
compute_benchmark_accuracy.ipynb: A simple notebook that uses results.txt and pairmeta.txt to compute the overall accuracy on the benchmark.

We'll soon be updating all 108 pairs! To add a new cause-effect pair,

Refer to results.txt to find a cause-effect pair that has not been scored.
Follow the protocol above to construct a prompt and get answers from ChatGPT.
Update the first two columns of results.txt and then rerun compute_benchmark_accuracy.ipynb notebook.

WARNING: ChatGPT is a large language model and has no guarantee of providing the correct causality direction. Answers from ChatGPT or this repo should not be considered causal and we provide these results only for the purpose of exploratory research. In practice, we expect that domain experts will need to verify such results before using the inferred causal relationships for any downstream application.

amit-sharma / chatgpt-causality-pairs

readme

Benchmarking causal discovery using ChatGPT: The cause-effect pairs challenge