Andrea-de-Varda / prediction-resource

Scripts and data relative to the cloze probability response and predictability ratings for 205 sentences (1,726 words)
1 stars 1 forks source link

Cloze probability, ratings, and computational predictability estimates

Scripts and data for cloze probability responses, predictability ratings and Transformer-based surprisal estimates for 205 sentences (1,726 words) from the UCL reading corpus.

A detailed description of the dataset can be found in the paper:

de Varda, A. G., Marelli, M., & Amenta, S. (2023). Cloze probability, predictability ratings, and computational estimates for 205 English sentences, aligned with existing EEG and reading time data. Behavior Research Methods, 1-24.

The resource we release is aligned with:

Our dataset

Our dataset of cloze probability and predictability ratings is in the file ratings_and_cloze.csv; it is obtained from the item set item-set.csv from the UCL reading corpus (Frank et al. 2013). This dataset is merged with the behavioral and neural measures described above in the dataframe all_measures.csv. The raw data (Prolific exports) can be found in the folders cp (cloze probability) and ratings.

We also release the cloze distributions (i.e., not only the probability assigned to the target words, but to all the words that were produced in the cloze task). They can be found in the cloze_distribution folder, both in .txt and .pkl format.

:heavy_exclamation_mark: Important note: If you use the neural and behavioral data, or the older probabilistic estimates (RNN, PSG, N-grams) please cite:

The code

The code for our analyses is divided in four scripts:

Supplementary materials

In the folder supplementary_materials you can find the complete results of the analyses we reported in our paper in a more searchable csv format.

Contact :envelope:

If you have any troubles with the resource, please do not hesitate and contact me at a.devarda@campus.unimib.it