Open kiukchung opened 1 year ago
This requires a format in the text file where documents are separated by empty lines. The following example consists of 2 documents, where each line is considered as a sentence in a document.
We use wikipedia corpus.
Wikipedia is great.
We also use bookcorpus.
Bookcorpus is also helpful.
So when the tokenize_lines
function reaches the end of the file, if not line
is triggered and we break the loop. When the function reads the empty line between two documents, the second if not line
is triggered after line.strip()
on line = '\n'
.
Description
In the function
scripts.pretraining.bert.create_pretraining_data.tokenize_lines()
The code snippet:
Suggests that empty or null lines (e.g.
""
orNone
) break the for-loop returning only the lines that have been processed so far whereas stripped-empty lines (e.g." "
) are used as document delimiters.Could someone shed light as to what the (empty line + break-from-loop) is meant to accomplish? Are empty/null lines used as delimiters?