allenai / fm-cheatsheet

Website for hosting the Open Foundation Models Cheat Sheet.
https://fmcheatsheet.org
248 stars 18 forks source link

Intro Text for Data Cleaning Page #18

Closed danmcduff closed 3 days ago

danmcduff commented 1 month ago

Replace

Data cleaning and filtering are crucial steps in curating a dataset. They remove unwanted data, improving training efficiency and ensuring desirable properties like high information content, desired languages, low toxicity, and minimal personally identifiable information. Consider trade-offs when using filters and understand the importance of data mixing in preparation.

With

Data quality is crucial. Filtering can remove unwanted data, improving training efficiency and ensuring desirable properties like high information content, desired languages, low toxicity, and minimal personally identifiable information. Consider trade-offs when using filters and understand the importance of data mixtures.

neural-loop commented 1 month ago

https://onm-demo.aimodels.org/foundation-model-resources/data-cleaning-filtering-mixing/