danbraunai / simple_stories_train

Trains small LMs. Designed for training on SimpleStories
3 stars 1 forks source link

Allow for training on custom dataset and tokenizer #8

Open danbraunai opened 2 months ago

danbraunai commented 2 months ago

Currently the script hardcodes loading the gpt2 tokenizer, and loads the dataset from file. We'll want to be able to allow for loading different tokenizers and datasets from huggingface.

In general I think we'll need to support:

  1. The dataset and tokenizer will be hosted on huggingface.
  2. The pre-tokenized dataset will be hosted on huggingface (so we don't have to tokenize it on the fly everytime we train).

I think we can just get away with using huggingface's load_dataset with streaming=True. An example is here, which supports loading tokenized or untokenized datasets. Then we would just need to set it up to work for DDP. Not sure of the easiest way, there's probably standard setups here, maybe using a distributed sampler.

lennart-finke commented 1 week ago

Saving weights is being addressed in #13