Intro Text for Data Duplication Page

Replace

Data deduplication is an important preprocessing step where duplicated documents, or chunks within a document, are removed from the dataset. Removing duplicates can reduce the likelihood of memorizing undesirable pieces of information such as boilerplate text, copyrighted data, and personally identifiable information. Additionally, removing duplicated data improves training efficiency by reducing the total dataset size. Practitioners should always determine whether duplicated data will harm or help the model for their use case.

With

Removing data duplicates can 1) reduce the likelihood of memorizing undesirable pieces of information such as boilerplate text, copyrighted data, and personally identifiable information, 2) improves training efficiency by reducing the total dataset size. Practitioners should always determine whether duplicated data will harm or help the model for their use case.

allenai / fm-cheatsheet

Intro Text for Data Duplication Page #19