facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.49k stars 2.1k forks source link

dialog_babi task's performance on MemNNs is bad. #499

Closed jojonki closed 6 years ago

jojonki commented 6 years ago

I am trying to reproduce the following paper's result on ParlAI. But the performance looks bad compared to original results. The original result is here. Learning End-to-End Goal-Oriented Dialog, https://arxiv.org/abs/1605.07683 screen shot 2018-01-15 at 2 22 55 pm

And I tested the following command. So task 1's accuracy should be almost 100%. But it looks quite bad.

python examples/train_model.py -t dialog_babi:task:1 -m memnn
...
[ time:262s parleys:26042 ] {'total': 194, 'accuracy': 0.02577, 'f1': 0.08801, 'hits@k': {1: 0.0258, 5: 0.253, 10: 0.459, 100: 1.0}}

I am afraid current MemNNs implementation only supposes babi-20 tasks. For example, the number of dialog_babi candidates is 4212. But I found a such a code. Even I commented out this part, the performance was not improved. https://github.com/facebookresearch/ParlAI/blob/e0f16e9168839be12f72d3431b9819cf3d51fe10/parlai/agents/memnn/memnn.py#L148

Environment

alexholdenmiller commented 6 years ago

Hi @jojonki, I ran the following and immediately got solid training error:

python examples/train_model.py -t dialog_babi:task:1 -m memnn --dict-file /tmp/db_task1k.dict -vtim 180
...
[ time:2s parleys:97 ] {'total': 97, 'accuracy': 0.7526, 'f1': 0.8021, 'hits@k': {1: 0.753, 5: 0.907, 10: 0.938, 100: 1.0}}
...
valid:{'total': 6015, 'accuracy': 0.2991, 'f1': 0.3913, 'hits@k': {1: 0.299, 5: 0.669, 10: 0.672, 100: 0.722}}
[ new best accuracy: 0.2991 ]
...

also note that the truncation of the candidate set is only during training--it is used to speed up the training, as otherwise the model only takes a gradient step after ranking all of the candidates.

Note that you may have been missing a dictionary (by omitting the dict-file or model-file args), so the model would have been ranking sentences as just varying number of "UNK" tokens.