Sorry this is so many changes, it's hard to change something like this without impacting every part of the pipeline.
This PR effectively replaces any usage of the HuggingFace Datasets library in our pipeline with Dask DataFrames. Learn more about Dask here. In summary, Dask allows us to process data efficiently, speeding up all of the preprocessing steps. It also allows us to load data lazily during training, maintaining a low memory usage.
I want to note that using Dask isn't the best or ideal way to do these tasks. It's just a functioning way. Dask is actually more inefficient for smaller datasets that can just fit in memory. The best way to process data is fairly dependent on the data itself, and we should keep this in mind as we develop the code and train models.
This PR implements the following:
Organizes the code under a 'src' folder.
Splits the "download data" portion of the pipeline into two different methods:
Data can be downloaded from HF programmatically through a Python script, download_data.py. This utilizes the HF File system. This is good for small datasets, but does not scale well.
Data can also be downloaded by cloning the HF repo directly. I have added a script, download_c4.sh, to do this specifically with c4. I have attempted to document both of these routes properly, and my top priority once this is merged is to add to the readme(this branch doesn't have the updated readme).
Adds an additional preprocessing script for filtering/splitting the data.
Reworks the tokenize_data and train_tokenizer scripts to use Dask.
Adds onto the dataset.py script to load the data from Dask via an IterableDataset.
Sorry this is so many changes, it's hard to change something like this without impacting every part of the pipeline.
This PR effectively replaces any usage of the HuggingFace Datasets library in our pipeline with Dask DataFrames. Learn more about Dask here. In summary, Dask allows us to process data efficiently, speeding up all of the preprocessing steps. It also allows us to load data lazily during training, maintaining a low memory usage.
I want to note that using Dask isn't the best or ideal way to do these tasks. It's just a functioning way. Dask is actually more inefficient for smaller datasets that can just fit in memory. The best way to process data is fairly dependent on the data itself, and we should keep this in mind as we develop the code and train models.
This PR implements the following:
download_data.py
. This utilizes the HF File system. This is good for small datasets, but does not scale well.download_c4.sh
, to do this specifically with c4. I have attempted to document both of these routes properly, and my top priority once this is merged is to add to the readme(this branch doesn't have the updated readme).tokenize_data
andtrain_tokenizer
scripts to use Dask.dataset.py
script to load the data from Dask via an IterableDataset.