Open quetz opened 4 years ago
No, there is no easy way to do it.
If the training data is so large that it does not fit into memory, then most likely you can subsample random sentences and this won't significantly affect the quality.
Are you going to add encoding file-dataset? Now bpe.encode from list is working longer than bpe.train from file, isn't it odd? And bpe.train used less memory than bpe.encoding with full list loaded.
Right now tokenizer loads whole corpus in memory and it becomes an issue for large files.
Is it possible to read corpus file line-by-line or split it in any other way (while training as a whole)?