DRAGNLabs / 301r_retnet

2 stars 1 forks source link

Improved data pipeline #39

Closed JayOrten closed 3 months ago

JayOrten commented 3 months ago

Sorry this is so many changes, it's hard to change something like this without impacting every part of the pipeline.

This PR effectively replaces any usage of the HuggingFace Datasets library in our pipeline with Dask DataFrames. Learn more about Dask here. In summary, Dask allows us to process data efficiently, speeding up all of the preprocessing steps. It also allows us to load data lazily during training, maintaining a low memory usage.

I want to note that using Dask isn't the best or ideal way to do these tasks. It's just a functioning way. Dask is actually more inefficient for smaller datasets that can just fit in memory. The best way to process data is fairly dependent on the data itself, and we should keep this in mind as we develop the code and train models.

This PR implements the following: