Does A cause B? Or does B cause A?
Pairwise causal discovery is a fundamental open problem. Given two variables, the task is to determine which variable causes the other. As one of the key benchmarks for this task, Mooij et al. (2016) released the Tuebingen cause-effect pairs dataset with 108 pairs of real world variables.
As a fun exploration, we present these pairs of variables as prompts to ChatGPT to study the capabilities of large language models in inferring causality. ChatGPT performs significantly better than current SoTA algorithms on the Tuebingen benchmark. In the 74 pairs we have tried so far, ChatGPT obtains an accuracy of 92.5%. In comparison, the best known accuracy using conventional discovery methods is 70-80% [Mooij et al. (2016), Tagasovska et al. (2020), Compton et al. (2020), Salem et al. (2022)].
Crucially, ChatGPT does not need access to the data for each variable. It can infer causality simply from the variable names. We use the following prompt for each cause-effect pair:
Does changing [varA] cause a change in [varB]? Please answer in a single word: Yes or No.
We adopt the following protocol:
This repository contains four files:
results.txt
: A csv file containing the results for each cause-effect pair. The first two columns signify the result of Does A cause B, and Does B cause A, respectively. 1 means that ChatGPT outputted the correct answer and 0 means it outputted the incorrect answer. This file is based on the README.txt file provided by Tuebingen benchmark. prompts.txt
: For reproducibility, we provide the example prompt used for each cause-effect pair. pairmeta.txt
: This file contains the recommended weights to be used when computing the overall accuracy on the benchmark. compute_benchmark_accuracy.ipynb
: A simple notebook that uses results.txt and pairmeta.txt to compute the overall accuracy on the benchmark. We'll soon be updating all 108 pairs! To add a new cause-effect pair,
WARNING: ChatGPT is a large language model and has no guarantee of providing the correct causality direction. Answers from ChatGPT or this repo should not be considered causal and we provide these results only for the purpose of exploratory research. In practice, we expect that domain experts will need to verify such results before using the inferred causal relationships for any downstream application.