Tokenizing large corpus

VKCOM / YouTokenToMe

Unsupervised text tokenizer focused on computational efficiency

MIT License

959 stars 103 forks source link

Tokenizing large corpus #80

Open quetz opened 4 years ago

quetz commented 4 years ago

Right now tokenizer loads whole corpus in memory and it becomes an issue for large files.

Is it possible to read corpus file line-by-line or split it in any other way (while training as a whole)?

xbelonogov commented 4 years ago

No, there is no easy way to do it.

If the training data is so large that it does not fit into memory, then most likely you can subsample random sentences and this won't significantly affect the quality.

rrrepsac commented 3 years ago

Are you going to add encoding file-dataset? Now bpe.encode from list is working longer than bpe.train from file, isn't it odd? And bpe.train used less memory than bpe.encoding with full list loaded.