alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
MIT License
528 stars 20 forks source link

Hangs with PyTorch data loaders when `num_workers > 0` #34

Open ntoxeg opened 3 months ago

ntoxeg commented 3 months ago

OS: Ubuntu 22.04 Python version: 3.11.8 PyTorch version: 2.2.1 Tokenmonster package version: 1.1.12 Other libraries: lightning==2.2.1, datasets==2.18.0

Like in the title, I load the tokenizer with load_multiprocess_safe, the dataset is just a bunch of plain text files to load and tokenize. I have tested each stage of loading and there are no problems until I wrap it in a DataLoader and use num_workers > 0, it hangs forever then.