alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
MIT License
528 stars 20 forks source link

"vocab.load_multiprocess_safe" doesn't work while multi-processing. #33

Closed ElleLeonne closed 4 months ago

ElleLeonne commented 4 months ago

I have two datasets, a train_set and an eval_set.

When using a single instance of the tokenizer, using the vocab.load_multiprocess_safe, passed to each dataset, the tokenizer simply refuses to function, regardless of whether or not it is frozen, or whether the datasets are active at the same time.

I am able to resolve the issue by using only vocab.load, however I get yelled at about multiprocessing, so I need to further debug by passing separate instances of the tokenizer to each dataset. This not ideal, but it is at least functional.

Simply as an FYI. I appreciate the work you've done so far, it's always nice to see independent coders and researchers doing cool things.

ElleLeonne commented 4 months ago

Resolved. Tokenmonster expects a list to tokenize, as it handles the list-comprehension itself. Attempting to tokenize inside of a list comprehension causes it to freeze.

Simply perform vocab.load_multiprocess_safe, and call vocab.tokenize(list) to resolve the issue.