Incorrect Hyperparameters ?

datarpit commented 4 years ago

Are the hyper parameters referenced here correct ? The accuracy after training using this is less than 1%.

shanemoon commented 4 years ago

Hi @datarpit, thank you for your comment. The baseline reported in README does use num_train_epochs=1, but please feel free to do further hyperparameter search.

The GPT2 based model does seem to have a bit of train/inference variability, but it's odd that the trained model is achieving accuracy less than 1%. Which metric (act/slot F1/prec/recall) are you referring to? Also please note that the reported numbers are in fraction (not in %), hence maximum value would be 1.0 (=100%) for all metrics.

datarpit commented 4 years ago

Numbers I see are consistently low around 0.0012. During inference, for most of devtest the model doesn't predict anything for belief state, but only system response. When I increased the epochs to 100 The results changed to below, but are still much lower than what README describes. fashion

{
  "joint_accuracy": 0.14777937901218394,
  "act_rec": 0.2267784619415695,
  "act_prec": 0.7445161290322581,
  "act_f1": 0.34766017272544686,
  "slot_rec": 0.24814931485273273,
  "slot_prec": 0.7452696310312205,
  "slot_f1": 0.37232659813304975
}

fashion_to

{
  "joint_accuracy": 0.052142014935150006,
  "act_rec": 0.09236211188261496,
  "act_prec": 0.7654723127035831,
  "act_f1": 0.16483516483516483,
  "slot_rec": 0.09867695700110253,
  "slot_prec": 0.6976614699331849,
  "slot_f1": 0.17289913067476195
}

Another thing I wanted to mention is in the evaluation script it seems to pick targets from a folder gpt2_dst/data/v2 however there is no such folder and I had to change the script to pick from gpt2_dst/data/. Can you please check if everything is in order and there isn't a silly bug.

shanemoon commented 4 years ago

Hi @datarpit, thank you for sharing the results. Would you mind sharing the train configuration you used as well (e.g. --n_gpu, --nocuda, batchsize, fp16 training, etc.), and perhaps the version of the gpt2 model please? Alternatively, if you have a log file of the training process, I'll take a look. We'll soon share the baseline checkpoints to mitigate the issue for now.

Also thank you for catching gpt2_dst/data/v2, it should read gpt2_dst/data and the fix is now pushed.

datarpit commented 4 years ago

I didn't change anything in the script except that I run it with CUDA_VISIBLE_DEVICES=0. That will be great.

chetannaik commented 4 years ago

Same issue. I ran the baseline using the code in the repo (without any changes) and the f1 number that I see are lesser than what's reported.

Here's what I got for "Fashion (multimodal)",

Obtained results:

~/simmc/mm_dst/gpt2_dst/results/fashion
❯ jq . fashion_devtest_dials_report.json
{
  "joint_accuracy": 0.06052666055286257,
  "act_rec": 0.09603039434036421,
  "act_prec": 0.5154711673699015,
  "act_f1": 0.16189950303699616,
  "slot_rec": 0.08459525885990644,
  "slot_prec": 0.5049692380501657,
  "slot_f1": 0.1449137579790846
}

Expected/Reported baseline results:

Baseline	Dialog Act F1	Slot F1
GPT2 - Fashion (multimodal)	44.3	46.6

Note: I ran this on just 1 GPU, multi-GPU training was throwing the following error during eval step.

07/17/2020 00:15:53 - INFO - __main__ -   ***** Running evaluation  *****
07/17/2020 00:15:53 - INFO - __main__ -     Num examples = 3513
07/17/2020 00:15:53 - INFO - __main__ -     Batch size = 32
Evaluating:  99%|█████████▉| 109/110 [00:26<00:00,  4.18it/s]
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ec2-user/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/chetnaik/simmc/mm_dst/gpt2_dst/scripts/run_language_modeling.py", line 821, in <module>
    main()
  File "/home/chetnaik/simmc/mm_dst/gpt2_dst/scripts/run_language_modeling.py", line 813, in main
    result = evaluate(args, model, tokenizer, prefix=prefix)
  File "/home/chetnaik/simmc/mm_dst/gpt2_dst/scripts/run_language_modeling.py", line 459, in evaluate
    outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    return self.gather(outputs, self.output_device)
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/cuda/comm.py", line 165, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [2, 1, 12, 168, 64], but expected [2, 4, 12, 168, 64]

skiingpacman commented 4 years ago

Hi @chetannaik, re. multi-GPU crashes I've seen the same and have a patch which I'll try to push this week.

shanemoon commented 4 years ago

Hi @chetannaik, the patch for the issue above has been just pushed.

@datarpit @chetannaik - please take a look at the model snapshots for the MM-DST baselines (link), which should give a good starting point - please feel free to fine-tune it further, etc. You can download it and put it under /simmc/mm_dst/save/.

The README file has been updated for the results obtained with these snapshots (trained with 2 GPUs - you can load training_args.bin for more details). Since n_gpu of the machine effectively changes the batch size for training (for which the GPT2 model is very sensitive), it is recommended that you find the right epoch & batch size that work the best (among other hyperparameters), to avoid overfitting & underfitting. Please feel free to re-open this if the issue persists after hyperparameter sweep. Thank you!

cccntu commented 4 years ago

I encountered the same issue, and I was using Transformers v3.0.2. Switching to v2.8.0 seems to solve the problem. @shanemoon Can you share your exact version? Thanks. 😃

facebookresearch / simmc

Incorrect Hyperparameters ? #13