Closed datarpit closed 4 years ago
Hi @datarpit, thank you for your comment. The baseline reported in README does use num_train_epochs=1, but please feel free to do further hyperparameter search.
The GPT2 based model does seem to have a bit of train/inference variability, but it's odd that the trained model is achieving accuracy less than 1%. Which metric (act/slot F1/prec/recall) are you referring to? Also please note that the reported numbers are in fraction (not in %), hence maximum value would be 1.0 (=100%) for all metrics.
Numbers I see are consistently low around 0.0012. During inference, for most of devtest the model doesn't predict anything for belief state, but only system response. When I increased the epochs to 100 The results changed to below, but are still much lower than what README describes.
fashion
{
"joint_accuracy": 0.14777937901218394,
"act_rec": 0.2267784619415695,
"act_prec": 0.7445161290322581,
"act_f1": 0.34766017272544686,
"slot_rec": 0.24814931485273273,
"slot_prec": 0.7452696310312205,
"slot_f1": 0.37232659813304975
}
fashion_to
{
"joint_accuracy": 0.052142014935150006,
"act_rec": 0.09236211188261496,
"act_prec": 0.7654723127035831,
"act_f1": 0.16483516483516483,
"slot_rec": 0.09867695700110253,
"slot_prec": 0.6976614699331849,
"slot_f1": 0.17289913067476195
}
Another thing I wanted to mention is in the evaluation script it seems to pick targets from a folder gpt2_dst/data/v2
however there is no such folder and I had to change the script to pick from gpt2_dst/data/
. Can you please check if everything is in order and there isn't a silly bug.
Hi @datarpit, thank you for sharing the results. Would you mind sharing the train configuration you used as well (e.g. --n_gpu, --nocuda, batchsize, fp16 training, etc.), and perhaps the version of the gpt2 model please? Alternatively, if you have a log file of the training process, I'll take a look. We'll soon share the baseline checkpoints to mitigate the issue for now.
Also thank you for catching gpt2_dst/data/v2
, it should read gpt2_dst/data
and the fix is now pushed.
I didn't change anything in the script except that I run it with CUDA_VISIBLE_DEVICES=0. That will be great.
Same issue. I ran the baseline using the code in the repo (without any changes) and the f1 number that I see are lesser than what's reported.
Here's what I got for "Fashion (multimodal)",
Obtained results:
~/simmc/mm_dst/gpt2_dst/results/fashion
❯ jq . fashion_devtest_dials_report.json
{
"joint_accuracy": 0.06052666055286257,
"act_rec": 0.09603039434036421,
"act_prec": 0.5154711673699015,
"act_f1": 0.16189950303699616,
"slot_rec": 0.08459525885990644,
"slot_prec": 0.5049692380501657,
"slot_f1": 0.1449137579790846
}
Expected/Reported baseline results:
Baseline | Dialog Act F1 | Slot F1 |
---|---|---|
GPT2 - Fashion (multimodal) | 44.3 | 46.6 |
Note: I ran this on just 1 GPU, multi-GPU training was throwing the following error during eval step.
07/17/2020 00:15:53 - INFO - __main__ - ***** Running evaluation *****
07/17/2020 00:15:53 - INFO - __main__ - Num examples = 3513
07/17/2020 00:15:53 - INFO - __main__ - Batch size = 32
Evaluating: 99%|█████████▉| 109/110 [00:26<00:00, 4.18it/s]
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/ec2-user/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/chetnaik/simmc/mm_dst/gpt2_dst/scripts/run_language_modeling.py", line 821, in <module>
main()
File "/home/chetnaik/simmc/mm_dst/gpt2_dst/scripts/run_language_modeling.py", line 813, in main
result = evaluate(args, model, tokenizer, prefix=prefix)
File "/home/chetnaik/simmc/mm_dst/gpt2_dst/scripts/run_language_modeling.py", line 459, in evaluate
outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
return self.gather(outputs, self.output_device)
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/cuda/comm.py", line 165, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [2, 1, 12, 168, 64], but expected [2, 4, 12, 168, 64]
Hi @chetannaik, re. multi-GPU crashes I've seen the same and have a patch which I'll try to push this week.
Hi @chetannaik, the patch for the issue above has been just pushed.
@datarpit @chetannaik - please take a look at the model snapshots for the MM-DST baselines (link), which should give a good starting point - please feel free to fine-tune it further, etc. You can download it and put it under /simmc/mm_dst/save/
.
The README file has been updated for the results obtained with these snapshots (trained with 2 GPUs - you can load training_args.bin
for more details). Since n_gpu
of the machine effectively changes the batch size for training (for which the GPT2 model is very sensitive), it is recommended that you find the right epoch & batch size that work the best (among other hyperparameters), to avoid overfitting & underfitting. Please feel free to re-open this if the issue persists after hyperparameter sweep. Thank you!
I encountered the same issue, and I was using Transformers v3.0.2. Switching to v2.8.0 seems to solve the problem. @shanemoon Can you share your exact version? Thanks. 😃
Are the hyper parameters referenced here correct ? The accuracy after training using this is less than 1%.