facebookresearch / simmc

With the aim of building next generation virtual assistants that can handle multimodal inputs and perform multimodal actions, we introduce two new datasets (both in the virtual shopping domain), the annotation schema, the core technical tasks, and the baseline models. The code for the baselines and the datasets will be opensourced.
Other
131 stars 36 forks source link

Incorrect Hyperparameters ? #13

Closed datarpit closed 4 years ago

datarpit commented 4 years ago

Are the hyper parameters referenced here correct ? The accuracy after training using this is less than 1%.

shanemoon commented 4 years ago

Hi @datarpit, thank you for your comment. The baseline reported in README does use num_train_epochs=1, but please feel free to do further hyperparameter search.

The GPT2 based model does seem to have a bit of train/inference variability, but it's odd that the trained model is achieving accuracy less than 1%. Which metric (act/slot F1/prec/recall) are you referring to? Also please note that the reported numbers are in fraction (not in %), hence maximum value would be 1.0 (=100%) for all metrics.

datarpit commented 4 years ago

Numbers I see are consistently low around 0.0012. During inference, for most of devtest the model doesn't predict anything for belief state, but only system response. When I increased the epochs to 100 The results changed to below, but are still much lower than what README describes. fashion

{
  "joint_accuracy": 0.14777937901218394,
  "act_rec": 0.2267784619415695,
  "act_prec": 0.7445161290322581,
  "act_f1": 0.34766017272544686,
  "slot_rec": 0.24814931485273273,
  "slot_prec": 0.7452696310312205,
  "slot_f1": 0.37232659813304975
}

fashion_to

{
  "joint_accuracy": 0.052142014935150006,
  "act_rec": 0.09236211188261496,
  "act_prec": 0.7654723127035831,
  "act_f1": 0.16483516483516483,
  "slot_rec": 0.09867695700110253,
  "slot_prec": 0.6976614699331849,
  "slot_f1": 0.17289913067476195
}

Another thing I wanted to mention is in the evaluation script it seems to pick targets from a folder gpt2_dst/data/v2 however there is no such folder and I had to change the script to pick from gpt2_dst/data/. Can you please check if everything is in order and there isn't a silly bug.

shanemoon commented 4 years ago

Hi @datarpit, thank you for sharing the results. Would you mind sharing the train configuration you used as well (e.g. --n_gpu, --nocuda, batchsize, fp16 training, etc.), and perhaps the version of the gpt2 model please? Alternatively, if you have a log file of the training process, I'll take a look. We'll soon share the baseline checkpoints to mitigate the issue for now.

Also thank you for catching gpt2_dst/data/v2, it should read gpt2_dst/data and the fix is now pushed.

datarpit commented 4 years ago

I didn't change anything in the script except that I run it with CUDA_VISIBLE_DEVICES=0. That will be great.

chetannaik commented 4 years ago

Same issue. I ran the baseline using the code in the repo (without any changes) and the f1 number that I see are lesser than what's reported.

Here's what I got for "Fashion (multimodal)",

Obtained results:

~/simmc/mm_dst/gpt2_dst/results/fashion
❯ jq . fashion_devtest_dials_report.json
{
  "joint_accuracy": 0.06052666055286257,
  "act_rec": 0.09603039434036421,
  "act_prec": 0.5154711673699015,
  "act_f1": 0.16189950303699616,
  "slot_rec": 0.08459525885990644,
  "slot_prec": 0.5049692380501657,
  "slot_f1": 0.1449137579790846
}

Expected/Reported baseline results:

Baseline Dialog Act F1 Slot F1
GPT2 - Fashion (multimodal) 44.3 46.6

Note: I ran this on just 1 GPU, multi-GPU training was throwing the following error during eval step.

07/17/2020 00:15:53 - INFO - __main__ -   ***** Running evaluation  *****
07/17/2020 00:15:53 - INFO - __main__ -     Num examples = 3513
07/17/2020 00:15:53 - INFO - __main__ -     Batch size = 32
Evaluating:  99%|█████████▉| 109/110 [00:26<00:00,  4.18it/s]
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ec2-user/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/chetnaik/simmc/mm_dst/gpt2_dst/scripts/run_language_modeling.py", line 821, in <module>
    main()
  File "/home/chetnaik/simmc/mm_dst/gpt2_dst/scripts/run_language_modeling.py", line 813, in main
    result = evaluate(args, model, tokenizer, prefix=prefix)
  File "/home/chetnaik/simmc/mm_dst/gpt2_dst/scripts/run_language_modeling.py", line 459, in evaluate
    outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    return self.gather(outputs, self.output_device)
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/chetnaik/simmc_env/lib/python3.7/site-packages/torch/cuda/comm.py", line 165, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [2, 1, 12, 168, 64], but expected [2, 4, 12, 168, 64]
skiingpacman commented 4 years ago

Hi @chetannaik, re. multi-GPU crashes I've seen the same and have a patch which I'll try to push this week.

shanemoon commented 4 years ago

Hi @chetannaik, the patch for the issue above has been just pushed.

@datarpit @chetannaik - please take a look at the model snapshots for the MM-DST baselines (link), which should give a good starting point - please feel free to fine-tune it further, etc. You can download it and put it under /simmc/mm_dst/save/.

The README file has been updated for the results obtained with these snapshots (trained with 2 GPUs - you can load training_args.bin for more details). Since n_gpu of the machine effectively changes the batch size for training (for which the GPT2 model is very sensitive), it is recommended that you find the right epoch & batch size that work the best (among other hyperparameters), to avoid overfitting & underfitting. Please feel free to re-open this if the issue persists after hyperparameter sweep. Thank you!

cccntu commented 4 years ago

I encountered the same issue, and I was using Transformers v3.0.2. Switching to v2.8.0 seems to solve the problem. @shanemoon Can you share your exact version? Thanks. 😃