Currently the script hardcodes loading the gpt2 tokenizer, and loads the dataset from file. We'll want to be able to allow for loading different tokenizers and datasets from huggingface.
In general I think we'll need to support:
The dataset and tokenizer will be hosted on huggingface.
The pre-tokenized dataset will be hosted on huggingface (so we don't have to tokenize it on the fly everytime we train).
I think we can just get away with using huggingface's load_dataset with streaming=True. An example is here, which supports loading tokenized or untokenized datasets. Then we would just need to set it up to work for DDP. Not sure of the easiest way, there's probably standard setups here, maybe using a distributed sampler.
Currently the script hardcodes loading the gpt2 tokenizer, and loads the dataset from file. We'll want to be able to allow for loading different tokenizers and datasets from huggingface.
In general I think we'll need to support:
I think we can just get away with using huggingface's load_dataset with streaming=True. An example is here, which supports loading tokenized or untokenized datasets. Then we would just need to set it up to work for DDP. Not sure of the easiest way, there's probably standard setups here, maybe using a distributed sampler.