The experimental setup of the accuracy of 92.5% seems overly simplified.

ArrogantL commented 1 year ago

``In the 74 pairs we have tried so far, ChatGPT obtains an accuracy of 92.5%.''

In this experimental setup, may I ask if all 74 pairs of events are causally related?

If that's the case, then the experiment seems to only focus on determining the direction of causality, rather than its existence. Do you have any experimental results that simultaneously consider both the direction and existence of causality?

For example: Input: event1. event2 Output: one of the three choices

cause to effect
effect to cause 3.non-causal

amit-sharma commented 1 year ago

Good point, we have now added those experiments. The tubingen benchmark did not contain any pairs that did not have a causal relationship. So we added two new datasets: neuropathic pain dataset and an arctic atmospheric science dataset. Here we ask chatGPT to decide between all three options: A->B, B->A, and no effect. We find that ChatGPT and GPT-4 do well on this task too, although the accuracy is less than 92%. You can find the results here: https://arxiv.org/abs/2305.00050

ArrogantL commented 1 year ago

Thank you for your patient reply! I have found this part of the experiment.

amit-sharma / chatgpt-causality-pairs

The experimental setup of the accuracy of 92.5% seems overly simplified. #11