Pretraining data consists of thousands, or even millions, of individual documents, often web scraped. Model knowledge and behavior will likely reflect a compression of this information and its communication qualities. It's important to carefully select the data composition. This decision should reflect choices in language coverage, mix of sources, and preprocessing decisions.
with:
Pretraining data provides the fundamental ingredient to foundation models—including their capabilities and flaws. Corpora consist of millions of pieces of content, from documents, images, videos, or speech recordings, often scraped from the web. It is important to carefully select the data composition and it should reflect choices in language coverage, a mixture of sources, and preprocessing decisions.
Replace:
Pretraining data consists of thousands, or even millions, of individual documents, often web scraped. Model knowledge and behavior will likely reflect a compression of this information and its communication qualities. It's important to carefully select the data composition. This decision should reflect choices in language coverage, mix of sources, and preprocessing decisions.
with:
Pretraining data provides the fundamental ingredient to foundation models—including their capabilities and flaws. Corpora consist of millions of pieces of content, from documents, images, videos, or speech recordings, often scraped from the web. It is important to carefully select the data composition and it should reflect choices in language coverage, a mixture of sources, and preprocessing decisions.