Erin M. Buchanan, Simon De Deyne, & Maria Montefinese
Abstract: Semantic property listing tasks require participants to generate short propositions (e.g., \<barks>, \<has fur>) for a specific concept (e.g., dog). This task is the cornerstone of the creation of semantic property norms which are essential for modelling, stimuli creation, and understanding similarity between concepts. However, despite the wide applicability of semantic property norms for a large variety of concepts across different groups of people, the methodological aspects of the property listing task have received less attention, even though the procedure and processing of the data can substantially affect the nature and quality of the measures derived from them. The goal of this paper is to provide a practical primer on how to collect and process semantic property norms. We will discuss the key methods to elicit semantic properties and compare different methods to derive meaningful representations from them. This will cover the role of instructions and test context, property pre-processing (e.g., lemmatization), property weighting, and relationship encoding using ontologies. With these choices in mind, we propose and demonstrate a processing pipeline that transparently documents these steps resulting in improved comparability across different studies. The impact of these choices will be demonstrated using intrinsic (e.g., reliability, number of properties) and extrinsic measures (e.g., categorization, semantic similarity, lexical processing). This practical primer will offer potential solutions to several longstanding problems and allow researchers to develop new property listing norms overcoming the constraints of previous studies.
Docs: Folder contains drafts and comments to versions of the manuscript.
Manuscript: Folder contains all information necessary to create the PDF/Docx version of the manuscript. Scripts are written inline with the text.
Output_data: Data created from scripts used in the processing pipeline.
Packrat: A compiled backup of the packages used in the manuscript and processing pipeline for reproducibility purposes.
R: R scripts detailed in the manuscript for individual use in the processing pipeline steps.
Raw_data: Data used to demonstrate the processing pipeline and the convergence with other similar projects.
Update: If you have issues with TreeTagger
, please check out our discussion on udpipe
here.