Closed BillKiller closed 1 month ago
the performance by trained from given base checkpoint[downloaded from given google driver] : 03/31/2024 04:46:48 - INFO - trainer - Eval results 03/31/2024 04:46:48 - INFO - trainer - epoch = -1 03/31/2024 04:46:48 - INFO - trainer - intent_acc = 0.7729468599033816 03/31/2024 04:46:48 - INFO - trainer - intent_f1 = 0.8870732419025387 03/31/2024 04:46:48 - INFO - trainer - loss = 7.091539016136756 03/31/2024 04:46:48 - INFO - trainer - mean_intent_slot = 0.8245242334768335 03/31/2024 04:46:48 - INFO - trainer - num_acc = 0.9963768115942029 03/31/2024 04:46:48 - INFO - trainer - semantic_frame_acc = 0.49033816425120774 03/31/2024 04:46:48 - INFO - trainer - slot_acc = 0.5289855072463768 03/31/2024 04:46:48 - INFO - trainer - slot_f1 = 0.8761016070502852 03/31/2024 04:46:48 - INFO - trainer - slot_precision = 0.8596134282807731 03/31/2024 04:46:48 - INFO - trainer - slot_recall = 0.8932346723044398
run by scratch: step1: python main.py --token_level word-level \ --model_type roberta \ --model_dir dir_base \ --task mixatis \ --data_dir data \ --attention_mode label \ --do_train \ --do_eval \ --num_train_epochs 100 \ --intent_loss_coef 0.5 \ --learning_rate 1e-5 \ --train_batch_size 32 \ --num_intent_detection \ --use_crf
step2 : python main.py --token_level word-level \ --model_type roberta \ --model_dir misca \ --task mixatis \ --data_dir data \ --attention_mode label \ --do_train \ --do_eval \ --model_dir best_model \ --num_train_epochs 100 \ --learning_rate 1e-5 \ --num_intent_detection \ --use_crf \ --base_model base_dir \ --intent_slot_attn_type coattention
the results by trained from scratch 03/31/2024 03:18:59 - INFO - trainer - Eval results 03/31/2024 03:18:59 - INFO - trainer - epoch = -1 03/31/2024 03:18:59 - INFO - trainer - intent_acc = 0.7693236714975845 03/31/2024 03:18:59 - INFO - trainer - intent_f1 = 0.884155237817333 03/31/2024 03:18:59 - INFO - trainer - loss = 7.854695576887864 03/31/2024 03:18:59 - INFO - trainer - mean_intent_slot = 0.821632243779036 03/31/2024 03:18:59 - INFO - trainer - num_acc = 0.9963768115942029 03/31/2024 03:18:59 - INFO - trainer - semantic_frame_acc = 0.48188405797101447 03/31/2024 03:18:59 - INFO - trainer - slot_acc = 0.538647342995169 03/31/2024 03:18:59 - INFO - trainer - slot_f1 = 0.8739408160604876 03/31/2024 03:18:59 - INFO - trainer - slot_precision = 0.8623617185490096 03/31/2024 03:18:59 - INFO - trainer - slot_recall = 0.8858350951374208
We ran the code in the same way as mentioned in the readme, but we did not get the results from the paper.
Thanks for your interest in our work. Due to some stochastic factors, it is necessary to slightly tune the hyper-parameters using grid search. In our experiments, we carefully tune the hyper-parameters including mixture_weight/learning_rate/... Hope it is helpful for you.
could you provide a set of hyper-parameters with certain random_seed to reproduce the results in paper?
You can perform grid search with the hyperparameter settings: learning rate [1e-6~5e-6], mixture weight [0.1, 0.2,...,0.9]. Hope it will help you.
I tried the range of hyperparameters you mentioned, but there is still a huge gap between the scores I got and the scores claimed in the paper. Can you provide the detailed training parameters at the time, such as the GPU model, the combination of hyperparameters set, and the value of the random seed?
I can not reproduce performance too. I hope author can provide more detail information.
You can try searching with a smaller step and/or training via an intermediate step (i.e. try freezing the RoBERTa part, then fine-tuning the whole model)
How many seeds did you run your experiment for? Is this result obtained from just a single run? What is the p-value for your improvements in confidence level? Is it confident for every metric? I've tried many times but cannot reproduce the results in your paper. Please do not close the issue before we can reproduce the scores claimed in the paper. Sincerely seeking your guidance.
Also, anyone who has successfully reproduced the results is welcome to provide me with some guidance. I would be very grateful!
I can't reproduce the results too. Especailly the jointlstm, the overall accuracy is too low on both datasets.
We have checked and updated more detailed instruction. In general, for the model with PLM, after having the “base" model, we load it and freeze the PLM encoder (simply add .detach()
after encoder output). The final stage is fine-tuning the full model, remember to perform grid search to make sure it achieves best performance. In our experiment, we use this checkpoint for MixATIS and this checkpoint for MixSNIPS as base model. In the case of MixATIS, you could try learning rate 3e-5 (freezing) and 3e-6 (after freezing).
@leevisin In terms of LSTM, we have to set the --only_intent
argument to 0.1, where the first 10% epochs will only optimize intent detection. We also should set --max_freq
to 5 or 10 for the MixATIS dataset where it will filter rare tokens to UNK. Please remember the default parameter we set is for MixATIS with PLM, so you will need to change it as we suggest in the paper or perform grid search.
Hope it will help you. Should you have any further question, do not hesitate to contact me thinhphp.nlp@gmail.com where I more often check the inbox.
Hello, i tried to reproduce model performance by train from scratch and from given base model checkpoint. However, none of these way produce performance claimed in paper. Could you give me more details to reproduce?
I download given checkpoint in best_model folder and run evaluate : evaluate scrips python main.py --token_level word-level \ --model_type roberta \ --model_dir misca \ --task mixatis \ --data_dir data \ --attention_mode label \ --do_train \ --do_eval \ --model_dir best_model \ --num_train_epochs 100 \ --learning_rate 1e-5 \ --num_intent_detection \ --use_crf \ --intent_slot_attn_type coattention
However, i got much lower performance.