finetune on GLUE task ends up with same probabality

TingchenFu commented 3 years ago

Hi guys, first of all Thanks to your great model! I finetuned the pretrained model named mlm_tlm_xnli15_1024.pth for MNLI-m task(to be specific, two class classification task). although seting the hyperparams as recommended: python glue-xnli.py --exp_name test_xnli_mlm_tlm # experiment name --dump_path ./dumped/ # where to store the experiment --model_path mlm_tlm_xnli15_1024.pth # model location --data_path ./data/processed/XLM15 # data location --transfer_tasks XNLI,SST-2 # transfer tasks (XNLI or GLUE tasks) --optimizer_e adam,lr=0.000025 # optimizer of projection (lr \in [0.000005, 0.000025, 0.000125]) --optimizer_p adam,lr=0.000025 # optimizer of projection (lr \in [0.000005, 0.000025, 0.000125]) --finetune_layers "0:_1" # fine-tune all layers --batch_size 8 # batch size (\in [4, 8]) --n_epochs 250 # number of epochs --epoch_size 20000 # number of sentences per epoch --max_len 256 # max number of words in sentences --max_vocab 95000 # max number of words in vocab
after several epochs I got EXACTLY same propabality output for all the valid cases: -0.27187905 -0.27174124 -0.27346167 -0.27336964 -0.27150354 -0.27345833 -0.2712339 -0.2730249 -0.2720655 -0.2718483 the number is the probabality of being classified as positive case given by the model. could any one tell me what happened and is there any possible solution for that?

TingchenFu commented 3 years ago

I found that after the first embedding layer in TransformerModel.fwd

tensor = self.embeddings(x)

tensor is same for all the different cases. self.embedding is defined as :

self.embeddings = Embedding(self.n_words, self.dim, padding_idx=self.pad_index)

where self.n_words=95000 and self.dim=1024 as in the pretrained_params. Is there anything wrong?

TingchenFu commented 3 years ago

the train log is here:https://paste.ubuntu.com/p/SbDw33JPjN/ and the complete probability result of valid dataset is here: https://paste.ubuntu.com/p/xXD9FfGdcT/

facebookresearch / XLM

finetune on GLUE task ends up with same probabality #332