Scripts and data for cloze probability responses, predictability ratings and Transformer-based surprisal estimates for 205 sentences (1,726 words) from the UCL reading corpus.
A detailed description of the dataset can be found in the paper:
de Varda, A. G., Marelli, M., & Amenta, S. (2023). Cloze probability, predictability ratings, and computational estimates for 205 English sentences, aligned with existing EEG and reading time data. Behavior Research Methods, 1-24.
The resource we release is aligned with:
Our dataset of cloze probability and predictability ratings is in the file ratings_and_cloze.csv
; it is obtained from the item set item-set.csv
from the UCL reading corpus (Frank et al. 2013). This dataset is merged with the behavioral and neural measures described above in the dataframe all_measures.csv
. The raw data (Prolific exports) can be found in the folders cp (cloze probability) and ratings.
We also release the cloze distributions (i.e., not only the probability assigned to the target words, but to all the words that were produced in the cloze task). They can be found in the cloze_distribution
folder, both in .txt
and .pkl
format.
:heavy_exclamation_mark: Important note: If you use the neural and behavioral data, or the older probabilistic estimates (RNN, PSG, N-grams) please cite:
The code for our analyses is divided in four scripts:
preprocessing.py
, which performs data cleaning and aggregation of results.merge_with_behavioural_data
, which merges our measurements with the neural and behavioural indexes of processing difficulty released by Frank et al. (2013, 2015).get_LM_surprisal.py
, which extracts surprisal values (negative log-probabilities) for the words in our dataset from Transformer-based language models released on the HuggingFace Hub.
plot.py
, which performs descriptive and inferential analyses and plots the results.In the folder supplementary_materials
you can find the complete results of the analyses we reported in our paper in a more searchable csv format.
If you have any troubles with the resource, please do not hesitate and contact me at a.devarda@campus.unimib.it