Pretraining Hyperparameters

wormyu commented 1 year ago

Hi, thanks for the nice work.

I'm trying to reproduce paper's result but notice that the hyperparameter you provide in this repositary (by pretraining script, config.json ) is a little different from your paper (ex : learning rate, gradient accumulation steps). I'm wondering which version should be used to reproduce the paper result, and which version of hyperparameter you use to get the checkpoint you provide?

Thanks for the reading!

Hannibal046 commented 1 year ago

You could try the hyperparameters in this repo.

wormyu commented 1 year ago

Thank you for your response!

I also wanted to confirm if the pre-training in this work follows the two-step approach similar to original BERT paper and NVIDIA/BERT . In those approaches, 90% of the training steps are done with a sequence length of 128 (phase 1), and the remaining 10% with a sequence length of 512 (phase 2). However, in the pre-training script provided in the PlugLM repository, I noticed a phase 2 pre-training with max_train_step=8000, but there was no explicit mention of phase 1 pre-training.

Could you please clarify if phase 1 pre-training is conducted in this work, and the time cost for total pre-training process? I appreciate your assistance!

Hannibal046 commented 1 year ago

All the baselines and PlugLM are pre-trained with only stage-2.

wormyu commented 1 year ago

Thanks for your kindly reply. I have another question, do you remember the training time of pre-training stage using 8 a100 gpus?

wormyu commented 1 year ago

Sorry for bothering again, I want to make sure I'm using the right knowledge corpus for AMAZON reviews. According to your README.md the amazon review dataset should be download using huggingface datasets, but there are several dataset relevant to amazon reviews on it, is this the one you use in domain adaptation task? Or did you download from https://cseweb.ucsd.edu/~jmcauley/datasets/amazon/links.html ?

Again, really thanks for your time to answer my question.

Hannibal046 commented 1 year ago

Hi,

if i remember it correctly, the training time of pre-training stage takes around 20 days in one single node.
for domain corpus, please follow the instruction from don't stop pre-training paper.

wormyu commented 1 year ago

Really thanks for you reply! According to this issue it seems all the corpus data should be download from their original source.

Sorry to bother but I have another question again. The PubMed dataset link you provided on their GitHub page they offers three options for the PubMed dataset. Could you kindly specify which one among those links was used as the knowledge base for the in-domain pretraining task? Furthermore, I'm curious if any preprocessing was conducted on the downloaded raw data.

Thanks for clarifying all this for me!

Hannibal046 commented 1 year ago

Hi, sorry for the late reply. Been busy recently. If I remember it correctly:

There are some license problems in DAPT dataset, and we just use the public available one, not in-house data.
We used this version: PubMed Central Full Texts .

wormyu commented 1 year ago

Hi, thanks again for replying, it solves my question.

I'm wondering what fine-tuning step you take in all the downstream tasks. I can only find run_classification.py will take 10 epochs in the script you provide, but for other tasks, I can't find relevant information in the README.md file or the paper. Can you give me some hints about this? Maybe I miss some parts of the code.

Thanks again for helping me!

Hannibal046 commented 1 year ago

Hi, for other tasks other than classification, you could write your own because you can simply take PlugLM as a BERT with same interface for downstream tasks. For biomed relevant tasks, you could refer to this: https://github.com/dmis-lab/biobert

wormyu commented 1 year ago

Hi, Thanks for replying and sorry for my misleading question, my question is "how many" fine-tuning steps do you take, not "what" fine-tuning steps. Because I'm trying to compare your model performance in the paper with mine, it only makes sense when comparing under the same training parameters. I know you have kindly shared Python files for other downstream tasks, and thanks for clarifying the biomed relevant task source for me, I appreciate it a lot!

Hannibal046 / PlugLM

Pretraining Hyperparameters #3