allenai / fm-cheatsheet

Website for hosting the Open Foundation Models Cheat Sheet.
https://fmcheatsheet.org
248 stars 18 forks source link

Intro Text for Data Duplication Page #19

Closed danmcduff closed 3 days ago

danmcduff commented 1 month ago

Replace

Data deduplication is an important preprocessing step where duplicated documents, or chunks within a document, are removed from the dataset. Removing duplicates can reduce the likelihood of memorizing undesirable pieces of information such as boilerplate text, copyrighted data, and personally identifiable information. Additionally, removing duplicated data improves training efficiency by reducing the total dataset size. Practitioners should always determine whether duplicated data will harm or help the model for their use case.

With

Removing data duplicates can 1) reduce the likelihood of memorizing undesirable pieces of information such as boilerplate text, copyrighted data, and personally identifiable information, 2) improves training efficiency by reducing the total dataset size. Practitioners should always determine whether duplicated data will harm or help the model for their use case.

neural-loop commented 1 month ago

https://onm-demo.aimodels.org/foundation-model-resources/data-deduplication/