Closed JoeyTPChou closed 2 years ago
hi @JoeyTPChou, we are reviewing this now.
hi @JoeyTPChou, thanks for your feedback here. We have made updates to the page to address your comments.
Thanks for the update @greg-serochi !
Hi @JoeyTPChou, can we close this one?
Yes, didn't aware that I need to close it, thanks for the remind
In (3) Partition the xz files obtained in step-2 in to three directories section, all the
mkdir
command is creatingopenwebtext-train
. It should bemkdir openwebtext-valid
andmkdir openwebtext-test
for the valid and test datasets.In (4) In each of the openwebtext-train, openwebtext-valid and openwebtext-test directories, ... section, we need to untar the
.xz
files first. The command could be changed to, usingopenwebtext-train
as an example:(cd openwebtext-train && find -name "*.xz" -exec tar -xf {} \; && find -name \*.txt -exec sh -c 'cat {} >> train.raw' \;)
In (6) Tokeninzing openwebtext section, there is missing
example
directory underPyTorch/nlp/GPT2/
while accessing theModel-References/PyTorch/nlp/GPT2/examples/roberta/multiprocessing_bpe_encoder.py
. We need to clone the original repo fairseq in order to use it.In (7) Binarizing the train, valid, test gpt2 tokenized files section, instead of
mkdir /data/OpenWebText_gpt2
we could domkdir -p /data/OpenWebText_gpt2