Missing script and command typo in Model-References/PyTorch/nlp/GPT2/GettingTheDataset.md document

JoeyTPChou commented 2 years ago

In (3) Partition the xz files obtained in step-2 in to three directories section, all the mkdir command is creating openwebtext-train. It should be mkdir openwebtext-valid and mkdir openwebtext-test for the valid and test datasets.
In (4) In each of the openwebtext-train, openwebtext-valid and openwebtext-test directories, ... section, we need to untar the .xz files first. The command could be changed to, using openwebtext-train as an example: (cd openwebtext-train && find -name "*.xz" -exec tar -xf {} \; && find -name \*.txt -exec sh -c 'cat {} >> train.raw' \;)
In (6) Tokeninzing openwebtext section, there is missing example directory under PyTorch/nlp/GPT2/ while accessing the Model-References/PyTorch/nlp/GPT2/examples/roberta/multiprocessing_bpe_encoder.py. We need to clone the original repo fairseq in order to use it.
In (7) Binarizing the train, valid, test gpt2 tokenized files section, instead of mkdir /data/OpenWebText_gpt2 we could do mkdir -p /data/OpenWebText_gpt2

greg-serochi commented 2 years ago

hi @JoeyTPChou, we are reviewing this now.

greg-serochi commented 2 years ago

hi @JoeyTPChou, thanks for your feedback here. We have made updates to the page to address your comments.

JoeyTPChou commented 2 years ago

Thanks for the update @greg-serochi !

greg-serochi commented 2 years ago

Hi @JoeyTPChou, can we close this one?

JoeyTPChou commented 2 years ago

Yes, didn't aware that I need to close it, thanks for the remind

HabanaAI / Model-References