Pretraining data format and possible corner case of seek_random_offset()

L0SG commented 5 years ago

Hi, thank you very much for the implementation!

I'm trying to compare your implementation with the official TF BERT head-to-head with the Gutenberg dataset (since the BookCorpus dataset is no longer available now).

I assume that the text input file format is the same as huggingface's implementation. Is that correct? A direct clarification of the text dataset format would be great for new users.

There might be a corner case of seek_random_offset() if using utf-8 text dataset (like the above) for pre-training. When doing f.seek(randint(0, max_offset), 0), If the function happens to truncate the utf-8 ' character (i.e. from \xe2\x80\x99 into something like \x99), pretrain.py will raise the error like the following:

File "/home/tkdrlf9202/PycharmProjects/pytorchic-bert/pretrain.py", line 88, in __iter__
seek_random_offset(self.f_neg)
File "/home/tkdrlf9202/PycharmProjects/pytorchic-bert/pretrain.py", line 41, in seek_random_offset
f.readline() # throw away an incomplete sentence
File "/home/tkdrlf9202/anaconda3/envs/p36/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 0: invalid start byte

The error could be mitigated if we use

self.f_pos = open(file, "r", encoding='utf-8', errors='ignore')
self.f_neg = open(file, "r", encoding='utf-8', errors='ignore')

instead of self.f_pos = open(file, 'r') in SentPairDataLoader, but half-silently removing some characters might lead to reproducibility issues (I guess chances are minimal since the f.readline() next to f.seek(randint(0, max_offset), 0) is for ditching the incomplete sequence).

I'd like to hear your opinions and thanks again for the contribution!

dhlee347 commented 5 years ago

Thanks for your suggestion ! (pretrain.py is still being developed :D)

Input file format is (from original Google BERT repo) :

One sentence per line. These should ideally be actual sentences, not entire paragraphs or arbitrary spans of text. (Because we use the sentence boundaries for the "next sentence prediction" task).
Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.

I'm working with Toronto Book Corpus, and my previous code works well for the entire corpus, but your suggestion makes sense for me because UTF-8 is de facto standard, so I updated my code (not fully tested).

For reproducibility issues, to be honest, my code doesn't have exactly the same functionalities as original code for simple and efficient implementation. For example, they load all documents at once and shuffle it and prepare TFRecord file with duplication. It needs large memory and takes a long time to pre-process. So my code just works on file pointers, and pre-process in training time. I don't think it is not so heavy in training time. Anyway, I want my code to reach the same performance after fine-tuning, not to reproduce the original codes exactly.

L0SG commented 5 years ago

I see, I also agree that the online data generation is more straight-forward. Thank you for the detailed explanation! I think we can close the issue, but anyone feel free to reopen the issue if bad things happen due to the fix.

dhlee347 / pytorchic-bert

Pretraining data format and possible corner case of seek_random_offset() #1