delphi-suite / delphi

small language models training made easy
Apache License 2.0
8 stars 1 forks source link

replaced sentencepiece with byte-level BPE #118

Closed jettjaniak closed 2 months ago

jettjaniak commented 2 months ago

https://huggingface.co/delphi-suite/stories-tokenizer is a result of scripts/train_tokenizer.py -i delphi-suite/stories -f story -s "train" -o delphi-suite/stories-tokenizer -v 4096 -t hf_... and https://huggingface.co/datasets/delphi-suite/stories-tokenized was tokenized using it (see #106 for details)