HomebrewNLP / Olmax

HomebrewNLP in JAX flavour for maintable TPU-Training
BSD 2-Clause "Simplified" License
45 stars 6 forks source link

"Resume" option for tokenizers #23

Open ClashLuke opened 2 years ago

ClashLuke commented 2 years ago

Currently, our tokenisers are long-running tasks that cannot be interrupted. If the process is stopped for even just a minute (for example, because GPU or CPU resources are needed elsewhere), the tokenisation has to be restarted from scratch. Instead of enforcing to run a process that can take multiple weeks in one go, we should implement an option to "resume" the state from an earlier checkpoint. This could be done, by, for example, skipping the first few documents or videos.\ This issue tracks the progress of implementing such a scheme.