delphi-suite / delphi

small language models training made easy
Apache License 2.0
8 stars 1 forks source link

tokenizer training script #103

Closed jettjaniak closed 2 months ago

jettjaniak commented 3 months ago

Looks like you need to setup black, CI is failing on that

jettjaniak commented 3 months ago

Also, could you think about some localized unit tests? Like we have a pre-defined string as a text to train on and we check if the resulting tokenizer has the same vocab and tokenizes text the way we expect.