delphi-suite / delphi

small language models training made easy
Apache License 2.0
8 stars 1 forks source link

`scripts/tokenize_dataset.py` is using too much memory #117

Closed jettjaniak closed 2 months ago

jettjaniak commented 2 months ago

when I tokenize training split of delphi-suite/stories I end up with > 20GB written to swap I left some # FIXMEs in tokenize_and_upload_split