VinAIResearch / MISCA

MISCA: A Joint Model for Multiple Intent Detection and Slot Filling with Intent-Slot Co-Attention (EMNLP 2023 - Findings)
GNU Affero General Public License v3.0
18 stars 3 forks source link

can not reproduce performance #8

Closed BillKiller closed 1 month ago

BillKiller commented 3 months ago

Hello, i tried to reproduce model performance by train from scratch and from given base model checkpoint. However, none of these way produce performance claimed in paper. Could you give me more details to reproduce?

I download given checkpoint in best_model folder and run evaluate : evaluate scrips python main.py --token_level word-level \ --model_type roberta \ --model_dir misca \ --task mixatis \ --data_dir data \ --attention_mode label \ --do_train \ --do_eval \ --model_dir best_model \ --num_train_epochs 100 \ --learning_rate 1e-5 \ --num_intent_detection \ --use_crf \ --intent_slot_attn_type coattention

However, i got much lower performance.

BillKiller commented 3 months ago

the performance by trained from given base checkpoint[downloaded from given google driver] : 03/31/2024 04:46:48 - INFO - trainer - Eval results 03/31/2024 04:46:48 - INFO - trainer - epoch = -1 03/31/2024 04:46:48 - INFO - trainer - intent_acc = 0.7729468599033816 03/31/2024 04:46:48 - INFO - trainer - intent_f1 = 0.8870732419025387 03/31/2024 04:46:48 - INFO - trainer - loss = 7.091539016136756 03/31/2024 04:46:48 - INFO - trainer - mean_intent_slot = 0.8245242334768335 03/31/2024 04:46:48 - INFO - trainer - num_acc = 0.9963768115942029 03/31/2024 04:46:48 - INFO - trainer - semantic_frame_acc = 0.49033816425120774 03/31/2024 04:46:48 - INFO - trainer - slot_acc = 0.5289855072463768 03/31/2024 04:46:48 - INFO - trainer - slot_f1 = 0.8761016070502852 03/31/2024 04:46:48 - INFO - trainer - slot_precision = 0.8596134282807731 03/31/2024 04:46:48 - INFO - trainer - slot_recall = 0.8932346723044398

run by scratch: step1: python main.py --token_level word-level \ --model_type roberta \ --model_dir dir_base \ --task mixatis \ --data_dir data \ --attention_mode label \ --do_train \ --do_eval \ --num_train_epochs 100 \ --intent_loss_coef 0.5 \ --learning_rate 1e-5 \ --train_batch_size 32 \ --num_intent_detection \ --use_crf

step2 : python main.py --token_level word-level \ --model_type roberta \ --model_dir misca \ --task mixatis \ --data_dir data \ --attention_mode label \ --do_train \ --do_eval \ --model_dir best_model \ --num_train_epochs 100 \ --learning_rate 1e-5 \ --num_intent_detection \ --use_crf \ --base_model base_dir \ --intent_slot_attn_type coattention

the results by trained from scratch 03/31/2024 03:18:59 - INFO - trainer - Eval results 03/31/2024 03:18:59 - INFO - trainer - epoch = -1 03/31/2024 03:18:59 - INFO - trainer - intent_acc = 0.7693236714975845 03/31/2024 03:18:59 - INFO - trainer - intent_f1 = 0.884155237817333 03/31/2024 03:18:59 - INFO - trainer - loss = 7.854695576887864 03/31/2024 03:18:59 - INFO - trainer - mean_intent_slot = 0.821632243779036 03/31/2024 03:18:59 - INFO - trainer - num_acc = 0.9963768115942029 03/31/2024 03:18:59 - INFO - trainer - semantic_frame_acc = 0.48188405797101447 03/31/2024 03:18:59 - INFO - trainer - slot_acc = 0.538647342995169 03/31/2024 03:18:59 - INFO - trainer - slot_f1 = 0.8739408160604876 03/31/2024 03:18:59 - INFO - trainer - slot_precision = 0.8623617185490096 03/31/2024 03:18:59 - INFO - trainer - slot_recall = 0.8858350951374208

BillKiller commented 3 months ago

We ran the code in the same way as mentioned in the readme, but we did not get the results from the paper.

thinhphp commented 3 months ago

Thanks for your interest in our work. Due to some stochastic factors, it is necessary to slightly tune the hyper-parameters using grid search. In our experiments, we carefully tune the hyper-parameters including mixture_weight/learning_rate/... Hope it is helpful for you.

BillKiller commented 3 months ago

could you provide a set of hyper-parameters with certain random_seed to reproduce the results in paper?

thinhphp commented 3 months ago

You can perform grid search with the hyperparameter settings: learning rate [1e-6~5e-6], mixture weight [0.1, 0.2,...,0.9]. Hope it will help you.

BillKiller commented 3 months ago

I tried the range of hyperparameters you mentioned, but there is still a huge gap between the scores I got and the scores claimed in the paper. Can you provide the detailed training parameters at the time, such as the GPU model, the combination of hyperparameters set, and the value of the random seed?

SJY8460 commented 3 months ago

I can not reproduce performance too. I hope author can provide more detail information.

thinhphp commented 1 month ago

You can try searching with a smaller step and/or training via an intermediate step (i.e. try freezing the RoBERTa part, then fine-tuning the whole model)

BillKiller commented 1 month ago

How many seeds did you run your experiment for? Is this result obtained from just a single run? What is the p-value for your improvements in confidence level? Is it confident for every metric? I've tried many times but cannot reproduce the results in your paper. Please do not close the issue before we can reproduce the scores claimed in the paper. Sincerely seeking your guidance.

BillKiller commented 1 month ago

Also, anyone who has successfully reproduced the results is welcome to provide me with some guidance. I would be very grateful!

leevisin commented 3 weeks ago

I can't reproduce the results too. Especailly the jointlstm, the overall accuracy is too low on both datasets.

thinhphp commented 2 weeks ago

We have checked and updated more detailed instruction. In general, for the model with PLM, after having the “base" model, we load it and freeze the PLM encoder (simply add .detach() after encoder output). The final stage is fine-tuning the full model, remember to perform grid search to make sure it achieves best performance. In our experiment, we use this checkpoint for MixATIS and this checkpoint for MixSNIPS as base model. In the case of MixATIS, you could try learning rate 3e-5 (freezing) and 3e-6 (after freezing).

@leevisin In terms of LSTM, we have to set the --only_intent argument to 0.1, where the first 10% epochs will only optimize intent detection. We also should set --max_freq to 5 or 10 for the MixATIS dataset where it will filter rare tokens to UNK. Please remember the default parameter we set is for MixATIS with PLM, so you will need to change it as we suggest in the paper or perform grid search.

Hope it will help you. Should you have any further question, do not hesitate to contact me thinhphp.nlp@gmail.com where I more often check the inbox.