"vocab.load_multiprocess_safe" doesn't work while multi-processing.

alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

MIT License

528 stars 20 forks source link

I have two datasets, a train_set and an eval_set.

When using a single instance of the tokenizer, using the vocab.load_multiprocess_safe, passed to each dataset, the tokenizer simply refuses to function, regardless of whether or not it is frozen, or whether the datasets are active at the same time.

I am able to resolve the issue by using only vocab.load, however I get yelled at about multiprocessing, so I need to further debug by passing separate instances of the tokenizer to each dataset. This not ideal, but it is at least functional.

Simply as an FYI. I appreciate the work you've done so far, it's always nice to see independent coders and researchers doing cool things.

alasdairforsythe / tokenmonster

"vocab.load_multiprocess_safe" doesn't work while multi-processing. #33