ant-research / StructuredLM_RTDT

A library for building hierarchical text representation and corresponding downstream applications.
Apache License 2.0
76 stars 15 forks source link

question about perplexity measures with R2D2 original model #3

Closed frankaging closed 2 years ago

frankaging commented 2 years ago

I have a few minor questions about the R2D2 PPPL measurements and their implementation.

Q1: In the paper, it says PPPL is defined as, exp(-(1/N) sum(L(S)))

This makes sense. But in the evaluation code here,

                log_p_sums, b_c, pppl = self.predictor(ids, self.bucket_size, self.get_bucket_id)
                PPPL += (pppl - PPPL) / counter
                print(PPPL, file=f_out)

We are outputting PPPL without taking the exponential. I assume the numbers in the paper are actually 2^{PPPL} here right? assuming we are using 2 as the base. I simply load a random BERT model, PPPL outputted here is around 10.4, 2^{10.4} ~= 1351, which is about right.

Q2: For pretraining the BERT model baseline, are you guys loading the same dataset as in the link below? or loading some default huggingface dataset? https://github.com/alipay/StructuredLM_RTDT/tree/r2d2/data/en_wiki

Sorry to throw random questions at you, but this would be very helpful for me to build something on top of this.

Thanks.

frankaging commented 2 years ago

Another follow-up question on this:

I tried to pretrain a BERT-base-uncased model with a slightly different version of wikitext-2 https://github.com/alipay/StructuredLM_RTDT/tree/r2d2/data/en_wiki, it basically contains ~700 less sentences, and I uploaded here https://huggingface.co/datasets/zhengxuanzenwu/wikitext-2-split-128.

Using this dataset, with the following config:

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10.0

I was able to train a BERT model with much lower PPPL, something like 2^4.8121 ~= 28.09, which is different from 103.54 reported in the paper. Am I missing something here? Is BERT-base-uncased the model you guys are using? Thanks!

note: i am using this single script from huggingface to pretrain BERT: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py

imhuim982 commented 2 years ago

Sorry, as it's the weekend in my local time, I didn't respond in time. We are glad to discuss the details with you~

For question 1, at first, we report the log result. But to align with the prior works, we transfer the result to e^x. As the definition of torch.log is log_e (x), so we supposed that should make sense.

For question2: The linkage is the test set for en_wiki2, as the training set is a little big. We just download it from an official site and split it by sentences. Lemme find the download URL for you later.

For the second post, according to the response for Q1, actually, I suppose it should be estimated as e^4.812=122.98. So it seems close to the result.

Please feel free to discuss if you have any further questions~. If you are interested, we are also glad to share some failed attempts:).

frankaging commented 2 years ago

Sorry, as it's the weekend in my local time, I didn't respond in time. We are glad to discuss the details with you~

For question 1, at first, we report the log result. But to align with the prior works, we transfer the result to e^x. As the definition of torch.log is log_e (x), so we supposed that should make sense.

For question2: The linkage is the test set for en_wiki2, as the training set is a little big. We just download it from an official site and split it by sentences. Lemme find the download URL for you later.

For the second post, according to the response for Q1, actually, I suppose it should be estimated as e^4.812=122.98. So it seems close to the result.

Please feel free to discuss if you have any further questions~. If you are interested, we are also glad to share some failed attempts:).

Thanks for the clarification. Do you mind if I follow up with an email asking about some of your failed attempts?

imhuim982 commented 2 years ago

Sure, no problem. My private mail address is imhuim982AT126DOTcom. You can either contact me through my company mail or private mail.