How about mask or mask/base?which one to choose for meta-test

gghhoosstt commented 2 years ago

Hi,It’s me! I have successfully run your code，and I got a series of checkpoints in dir "./results/squad/bert/2021-neural/mask" and "./results/squad/bert/2021-neural/mask_base", I wonder which one to choose for meta-train？or both？what is the meaning of ../mask or .../mask_base? Thank you!

Nardien commented 2 years ago

As described in the paper, we use the self-play to train the neural mask generator model. mask_base indicates the checkpoint from the opponent's neural model. There is not much difference, however, the opponent model has trained one less episode compared to the player model. In conclusion, I recommend you to use the checkpoints from the mask rather than mask_base.

gghhoosstt commented 2 years ago

ok，thank you very much! I saw the self-play algorithm. I wonder that why do you choose so complicated self-play algorithm to update NMG, have you ever test other simple algorithms? And the NMG is just also a meta-learning model, have you ever test MAML or other meta-learning algorithms to update it?

Nardien commented 2 years ago

Thank you for the great question and sorry for the late response due to the holidays.

We introduce the self-play algorithm to make the reward more accurate so that the NMG model learns the better action with the relative comparison between two different policies

As you mentioned, the use of reinforcement learning might be complicated and hard, as it needs several additional algorithms (self-play, replay buffer, entropy regularization, ...) for better RL model training. However, as in Section 3.2. Justification of Reinforcement Learning in our paper, it is really hard to train the NMG model with a gradient-based meta-learning algorithm (MAML, ...) since they commonly needs second-order derivates of the BERT parameters.

In addition, please also refer to this paper (https://aclanthology.org/2020.emnlp-main.497.pdf) if you want to look at a different approach to a similar objective (learning better masking policy).

gghhoosstt commented 2 years ago

ok，I have been taught a lesson，thank you !

gghhoosstt commented 2 years ago

Do you mean that bert is not Second-order derivable？But I found this paperupadates bert with MAML. Or you mean other reasons cannot use MAML？

Nardien commented 2 years ago

Thank you for answering the question and suggesting the related work. I believe you already understand our reasons explained on page 4 of our main paper - Justification of Reinforcement Learning. As written in the paper, the first-order approximation of the gradient-based meta-learning algorithm leads to the meaningless objective of the NMG parameters.

To be precise, BERT is second-order derivable but it's just hard. The computation of the second-order derivation in meta-learning generally needs a large size of GPU memory since the second-order derivation generally requires retaining the gradients from inner loops. It means that the number of inner loop iterations might be restricted by the computation budget for saving the gradients. (At the time of our research, we only used two 12GB GPUs for our experiments)

In addition, in our setting, we need to operate two sequential stages of training -- further pre-training and fine-tuning. It means that we need to keep track of gradients from both stages to appropriately update the NMG model parameters since NMG can only affect the further pre-training stage but the (meta-)loss for NMG is attained after the fine-tuning. At least, it requires twice many memory consumptions compared to only fine-tuning setup as in X-MAML.

In conclusion, I want to confirm that it is not impossible to use a gradient-based meta-learning algorithm (MAML) for the NMG parameters update if you have enough GPU resources. I think it is worth trying this approach, but I cannot sure that such approaches will be successful.

gghhoosstt commented 2 years ago

Thank you for your detailed explanation! Meanwhile, the paper update the mask policy with gradient-based method，in P5 figure2 and P13 Algorithm 1. Could you please take a look at it？Thank you!

Nardien commented 2 years ago

I see. That work is a notable example of using large amounts of the computational budgets for meta-learning masking policy with the gradient-based algorithm. (See Appendix D.2, they use 32 Quadro or V100 GPUs in parallel for meta-training BART-base).

gghhoosstt commented 2 years ago

ok，thank you！

Nardien / NMG

How about mask or mask/base?which one to choose for meta-test #4