Like in the title, I load the tokenizer with load_multiprocess_safe, the dataset is just a bunch of plain text files to load and tokenize. I have tested each stage of loading and there are no problems until I wrap it in a DataLoader and use num_workers > 0, it hangs forever then.
OS: Ubuntu 22.04 Python version: 3.11.8 PyTorch version: 2.2.1 Tokenmonster package version: 1.1.12 Other libraries:
lightning==2.2.1
,datasets==2.18.0
Like in the title, I load the tokenizer with
load_multiprocess_safe
, the dataset is just a bunch of plain text files to load and tokenize. I have tested each stage of loading and there are no problems until I wrap it in aDataLoader
and usenum_workers > 0
, it hangs forever then.