Closed frankaging closed 2 years ago
Another follow-up question on this:
I tried to pretrain a BERT-base-uncased
model with a slightly different version of wikitext-2
https://github.com/alipay/StructuredLM_RTDT/tree/r2d2/data/en_wiki
, it basically contains ~700 less sentences, and I uploaded here https://huggingface.co/datasets/zhengxuanzenwu/wikitext-2-split-128
.
Using this dataset, with the following config:
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10.0
I was able to train a BERT model with much lower PPPL, something like 2^4.8121 ~= 28.09, which is different from 103.54 reported in the paper. Am I missing something here? Is BERT-base-uncased
the model you guys are using? Thanks!
note: i am using this single script from huggingface to pretrain BERT: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py
Sorry, as it's the weekend in my local time, I didn't respond in time. We are glad to discuss the details with you~
For question 1, at first, we report the log result. But to align with the prior works, we transfer the result to e^x. As the definition of torch.log is log_e (x), so we supposed that should make sense.
For question2: The linkage is the test set for en_wiki2, as the training set is a little big. We just download it from an official site and split it by sentences. Lemme find the download URL for you later.
For the second post, according to the response for Q1, actually, I suppose it should be estimated as e^4.812=122.98. So it seems close to the result.
Please feel free to discuss if you have any further questions~. If you are interested, we are also glad to share some failed attempts:).
Sorry, as it's the weekend in my local time, I didn't respond in time. We are glad to discuss the details with you~
For question 1, at first, we report the log result. But to align with the prior works, we transfer the result to e^x. As the definition of torch.log is log_e (x), so we supposed that should make sense.
For question2: The linkage is the test set for en_wiki2, as the training set is a little big. We just download it from an official site and split it by sentences. Lemme find the download URL for you later.
For the second post, according to the response for Q1, actually, I suppose it should be estimated as e^4.812=122.98. So it seems close to the result.
Please feel free to discuss if you have any further questions~. If you are interested, we are also glad to share some failed attempts:).
Thanks for the clarification. Do you mind if I follow up with an email asking about some of your failed attempts?
Sure, no problem. My private mail address is imhuim982AT126DOTcom. You can either contact me through my company mail or private mail.
I have a few minor questions about the R2D2 PPPL measurements and their implementation.
Q1: In the paper, it says PPPL is defined as, exp(-(1/N) sum(L(S)))
This makes sense. But in the evaluation code here,
We are outputting PPPL without taking the exponential. I assume the numbers in the paper are actually 2^{PPPL} here right? assuming we are using 2 as the base. I simply load a random BERT model, PPPL outputted here is around 10.4, 2^{10.4} ~= 1351, which is about right.
Q2: For pretraining the BERT model baseline, are you guys loading the same dataset as in the link below? or loading some default huggingface dataset?
https://github.com/alipay/StructuredLM_RTDT/tree/r2d2/data/en_wiki
Sorry to throw random questions at you, but this would be very helpful for me to build something on top of this.
Thanks.