Open ariG23498 opened 4 days ago
Check out the https://github.com/ayulockin/llm_scratch/tree/main/src/llm/datasets directory, especially the datasets.py
and utils.py
files.
The download logic can be found in the utils.py
file. The datasets.py
file contains a class for downloading the WMT14 dataset and return raw strings for now. It can be wrapped in a Dataloader
class.
I was thinking of incorporating streaming to the data loading process -- https://huggingface.co/docs/datasets/en/stream
WDYT?
I will work with the PyTorch DataLoader system to build a data pipeline for an encoder decoder setup. I also think it would be a good thing to incorporate
datasets
so that we can easily download and upload (or maybe even stream) the data and push it into a pipeline.What would be an interesting setup later would be to time and benchmark the pipeline with different simple tricks. Also what are the tests that we should hold ourselves accountable to when we create the data pipeline?
Note: I would be happy to take this on.