Closed L0SG closed 5 years ago
Thanks for your suggestion ! (pretrain.py is still being developed :D)
Input file format is (from original Google BERT repo) :
I'm working with Toronto Book Corpus, and my previous code works well for the entire corpus, but your suggestion makes sense for me because UTF-8 is de facto standard, so I updated my code (not fully tested).
For reproducibility issues, to be honest, my code doesn't have exactly the same functionalities as original code for simple and efficient implementation. For example, they load all documents at once and shuffle it and prepare TFRecord file with duplication. It needs large memory and takes a long time to pre-process. So my code just works on file pointers, and pre-process in training time. I don't think it is not so heavy in training time. Anyway, I want my code to reach the same performance after fine-tuning, not to reproduce the original codes exactly.
Hi, thank you very much for the implementation!
I'm trying to compare your implementation with the official TF BERT head-to-head with the Gutenberg dataset (since the BookCorpus dataset is no longer available now).
I assume that the text input file format is the same as huggingface's implementation. Is that correct? A direct clarification of the text dataset format would be great for new users.
There might be a corner case of
seek_random_offset()
if using utf-8 text dataset (like the above) for pre-training. When doingf.seek(randint(0, max_offset), 0)
, If the function happens to truncate the utf-8'
character (i.e. from\xe2\x80\x99
into something like\x99
),pretrain.py
will raise the error like the following:The error could be mitigated if we use
instead of
self.f_pos = open(file, 'r')
inSentPairDataLoader
, but half-silently removing some characters might lead to reproducibility issues (I guess chances are minimal since thef.readline()
next tof.seek(randint(0, max_offset), 0)
is for ditching the incomplete sequence).I'd like to hear your opinions and thanks again for the contribution!