VisualJoyce / ChengyuBERT

[COLING 2020] BERT-based Models for Chengyu
MIT License
17 stars 3 forks source link

About parameters #9

Closed WinniyGD closed 3 years ago

WinniyGD commented 3 years ago

I used the parameters showed on your paper. image

pre-trained BERT:Chinese with Whole Word Masking (WWM) the maximum length:128 batch size:40 (4X10 GPU cards) initial learning rate: 0.00005 warm-up steps:1000 optimizer:AdamW scheduler:WarmupLinearSchedule epoch:5 (num_train_steps about 80800)

Because of my device (1 * GTX2080Ti), I set train_batch_size = 6000, num_train_steps about 80800. The epoch of the experiment is just 5. The batch size is just 40.

But I can not approach your accuracy, the following picture shows my experiment accuracy. image That's a difference of nearly 3~6 %. image

That's my trainning config json: { "train_txt_db": "official_train.db", "val_txt_db": "official_dev.db", "test_txt_db": "official_test.db", "out_txt_db": "official_out.db", "sim_txt_db": "official_sim.db", "ran_txt_db": "official_ran.db", "pretrained_model_name_or_path": "hfl/chinese-bert-wwm-ext", "model": "chengyubert-dual", "dataset_cls": "chengyu-masked", "eval_dataset_cls": "chengyu-masked-eval", "output_dir": "storage", "candidates": "combined", "len_idiom_vocab": 3848, "max_txt_len": 128, "train_batch_size": 6000, "val_batch_size": 20000, "gradient_accumulation_steps": 1, "learning_rate": 0.00005, "valid_steps": 100, "num_train_steps": 80800, "optim": "adamw", "betas": [ 0.9, 0.98 ], "adam_epsilon": 1e-08, "dropout": 0.1, "weight_decay": 0.01, "grad_norm": 1.0, "warmup_steps": 1000, "seed": 77, "fp16": true, "n_workers": 0, "pin_mem": true, "location_only": false }

What's wrong with the parameters?

VisualJoyce commented 3 years ago

I am not sure if this is due to gradients harvesting with mutliple GPUs.

My suggestion for single card is to feed the GPU with full memory and make gradient_accumulation_steps=5.

Let's see if that works.

WinniyGD commented 3 years ago

feed the GPU with full memory ? It means the 'train_batch_size' can use a bigger number than now used ?

WinniyGD commented 3 years ago

Could you please provide the JSON parameter file you used for training? I'll be grateful to you.

VisualJoyce commented 3 years ago

I have been updating this repo for a while, the original JSON config does not change much.

But I do recommend the following setting for single GPU,

    "train_batch_size": 11000,
    "gradient_accumulation_steps": 5,
    "num_train_steps": 18000,

I feel sorry for reproducing issues that you encountered, I hope we can find the reason through trials.

WinniyGD commented 3 years ago

I appreciate your kind help. I'll try the experiment again. ♥

WinniyGD commented 3 years ago

num_train_steps = 18000 , train_batch_size = 11000 The epoch is only 2 , not 5. Does it matter?

VisualJoyce commented 3 years ago

If we use gradient_accumulation_steps, for each step, it will use five times examples.

WinniyGD commented 3 years ago

OK! I know it. Thanks. I will try again. Hopefully it will succeed in achieving the desired score.

VisualJoyce commented 3 years ago

Yes, I also feel nervous about reproduction, although I tried my code several times.

I hope we can reproduce without difficulty and I will update the parameters for the benefit of all.

WinniyGD commented 3 years ago

Congratulations! The new experiment can achieve the desired score (approaching but not beyond.). image

It needs gradient_accumulation_steps = 5. But Why ? Could you explain the principle behind this?

VisualJoyce commented 3 years ago

Glad that works!

I think larger batch size converges better, this is mainly due to the stocastic gradients accumulated is closer to the full batch case when fitting the dataset.

WinniyGD commented 3 years ago

Well. That's amazing! I learn about that. What's more, the two-stage training appeared in your codes isn't in detail on your paper. What's about the two-stage? Will it get a higher score?

VisualJoyce commented 3 years ago

For two-stage, you can directly try Stage-Two. If you are interested in the paper, here is the link Two-Stage.

WinniyGD commented 3 years ago

Glad !

I have another question. You used 'valid_steps' to get the best score checkpoint. But in some cases, it is coincidental or fortuitous. From my observations, the valid-dataset accuracy during the training was mostly stable at 79. Do you want to use K-fold cross validation to get a more convincing score?

VisualJoyce commented 3 years ago

For this dataset, the size is large, so cross validation needs more computation. If the goal is to get better performance, this is a way.

In most cases, if the result can support that the method works, we will follow the train-dev-test which is used in most large scale QA task.

Biases of the dataset can be a seperate research topic.

WinniyGD commented 3 years ago

Oh, so that's it.

Because of my shortcoming, I know this kind of training for the first time. So It may be incredible. My teachers always used to ask me to use cross validation. Thank you very much. I am learning a new training form with your codes.

Appreciate your work and kind help, let me learn a lot.

VisualJoyce commented 3 years ago

Thank you for say so!

I also learned a lot from the acknowledged repos. I recommend you to try their code also.

WinniyGD commented 3 years ago

I'm sorry to bother you again.

I wanna know whether the codes of paper ( ' A BERT-based two-stage model for Chinese Chengyu recommendation ' about two-stage) are only using ' train_pretrain.py ' and ' train_official.py '? What's the difference between the stage-1-pretain and using 'train_pretrain. py'?

What's more, What's the difference among w/o Pre-Training 、w/o Fine-Tuning 、 w/o 𝐿V and w/o 𝐿A. (I don't quite understand what you're showing in your paper.)

Could you describe more details? Thanks very much.

VisualJoyce commented 3 years ago

How about starting new issues with each question so that I can answer them one by one.

I am suggesting this because this may help others who might have similar questions.

WinniyGD commented 3 years ago

OK !