[Adding] Data Pipeline - Githubissues

ayulockin / llm_scratch

Exploring LLMs by implementing and understanding all the key components, perform experiments and optimize it.

MIT License

0 stars 0 forks source link

[Adding] Data Pipeline #1

Open ariG23498 opened 4 days ago

ariG23498 commented 4 days ago

I will work with the PyTorch DataLoader system to build a data pipeline for an encoder decoder setup. I also think it would be a good thing to incorporate datasets so that we can easily download and upload (or maybe even stream) the data and push it into a pipeline.

What would be an interesting setup later would be to time and benchmark the pipeline with different simple tricks. Also what are the tests that we should hold ourselves accountable to when we create the data pipeline?

Note: I would be happy to take this on.

ayulockin commented 4 days ago

Check out the https://github.com/ayulockin/llm_scratch/tree/main/src/llm/datasets directory, especially the datasets.py and utils.py files.

The download logic can be found in the utils.py file. The datasets.py file contains a class for downloading the WMT14 dataset and return raw strings for now. It can be wrapped in a Dataloader class.

ariG23498 commented 4 days ago

I was thinking of incorporating streaming to the data loading process -- https://huggingface.co/docs/datasets/en/stream

WDYT?