The package implements a procedure described by Fritz Kliche, Andre
Blessing, Urlich Heid and Jonathan Sonntag in the paper “The eIdentity
Text ExplorationWorkbench” presented at LREC 2014 (see ). The main
function is detect_duplicates()
.
Near duplicate detection is a standard NLP task. There is a wide range of algorithms that are used for near duplicate detection and there is a broad set of implementations in the programming languages used for NLP tasks.
In the R context, the textreuse package is the point of reference for duplicate detection. The use case for the duplicates package is large corpora that have been indexed with the Corpus Workbench (CWB). The hashing step which is a selling point for the textreuse package is performed already, and requirements for tokenizing and hashing the data are not replicated. The scenario for using the duplicates package is large, CWB-indexed corpora.