VinAIResearch / PhoBERT

PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)
MIT License
651 stars 92 forks source link

creating a pre-trained model #3

Closed vr25 closed 4 years ago

vr25 commented 4 years ago

Hi,

Thank you for releasing the language-specific model along with the instructions.

I want to create a similar language-specific pre-trained model. I was wondering if you could share the pre-training scripts and toy data (and maybe a short write-up) so that it is easier to pre-train similar BERT-based models in another language.

I just have one important question; how do you chunk the documents where the text is longer than 512 tokens. Also, do you just simply split at 512 tokens even if that sentence hasn't ended and consider another 512 tokens chunk beginning at the previous chunk end? Does this take a lot of memory?

Thanks!

datquocnguyen commented 4 years ago

Hi, we follow the tutorial here: https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md

Each line in the pretraining dataset associates with a sentence. If a sentence is longer than the max sentence length, e.g. 256, then the sentence will be ignored during training.

Cheers, Dat.

vr25 commented 4 years ago

Hi,

I followed the instructions at https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md

But I am running into this error: image

I am using the wikitext-103-raw. Do you think it would be because of different PyTorch or TensorFlow versions?

I am using this configuration: 4x NVIDIA Tesla V100 GPUs with 16 GiB of memory.

Thanks, again!