[new resource]: Corpus Analysis with spaCy

Title of the resource

Corpus Analysis with spaCy

Resource type

External Resource

Authors, editors and contributors

Megan S. Kane, Maria Antoniak, William Mattingly, John R. Ladd

Topics (keywords)

DH, Open Education, Open Access, data manipulation, distant reading, python

Learning outcomes

After completing this lesson, you will be able to:

Upload a corpus of texts to a platform for Python analysis (using Google Colaboratory)
Use spaCy to enrich the corpus through tokenization, lemmatization, part-of-speech tagging, dependency parsing and chunking, and named entity recognition
Conduct frequency analyses using part-of-speech tags and named entities
Download an enriched dataset for use in future NLP analyses

Abstract

This lesson demonstrates how to use the Python library spaCy for analysis of large collections of texts. This lesson details the process of using spaCy to enrich a corpus via lemmatization, part-of-speech tagging, dependency parsing, and named entity recognition. Readers will learn how the linguistic annotations produced by spaCy can be analyzed to help researchers explore meaningful trends in language patterns across a set of texts.

DARIAH-ERIC / dariah-campus