allenai / fm-cheatsheet

Website for hosting the Open Foundation Models Cheat Sheet.
https://fmcheatsheet.org
248 stars 18 forks source link

Intro Text for Pretraining Page #15

Closed danmcduff closed 11 hours ago

danmcduff commented 1 month ago

Replace:

Pretraining data consists of thousands, or even millions, of individual documents, often web scraped. Model knowledge and behavior will likely reflect a compression of this information and its communication qualities. It's important to carefully select the data composition. This decision should reflect choices in language coverage, mix of sources, and preprocessing decisions.

with:

Pretraining data provides the fundamental ingredient to foundation models—including their capabilities and flaws. Corpora consist of millions of pieces of content, from documents, images, videos, or speech recordings, often scraped from the web. It is important to carefully select the data composition and it should reflect choices in language coverage, a mixture of sources, and preprocessing decisions.

neural-loop commented 1 month ago

https://onm-demo.aimodels.org/foundation-model-resources/pretraining-data-sources/