Controlling the size of a Wikipedia Dump

Recognizing Steps to controlling the size of the data going into storage

Currently the system has 4 TB or storage. It is important to contoll the size of the data that is used for the model when sorting the dump. Distributed semantics language models such as Word2Vec and GloVe, and BERT these are word embedding strategies use to unsure controll over the size of the data that comes from the dump when storing it and processing properly for the language model.

The 8 Steps of implementations needed:

Focus on High-Frequency Words
Leverage Pre-trained Embeddings
Domain-Specific Extraction
TExt Segmentation and Chunking
Filtering Based on Text Length and Quality
Utilize External Corpora
Compression and Storage optimization
Parallelization and Distributed Computing

JessicaWoods03 / Furby_Hack

Controlling the size of a Wikipedia Dump #3

Recognizing Steps to controlling the size of the data going into storage

The 8 Steps of implementations needed: