allenai / fm-cheatsheet

Website for hosting the Open Foundation Models Cheat Sheet.
248 stars 18 forks source link

Intro Text for Data Cleaning Page #18

Closed danmcduff closed 3 days ago

danmcduff commented 1 month ago


Data cleaning and filtering are crucial steps in curating a dataset. They remove unwanted data, improving training efficiency and ensuring desirable properties like high information content, desired languages, low toxicity, and minimal personally identifiable information. Consider trade-offs when using filters and understand the importance of data mixing in preparation.


Data quality is crucial. Filtering can remove unwanted data, improving training efficiency and ensuring desirable properties like high information content, desired languages, low toxicity, and minimal personally identifiable information. Consider trade-offs when using filters and understand the importance of data mixtures.

neural-loop commented 1 month ago