Mem2Seq model accuracy stuck at 0

se4u commented 6 years ago

Hi,

I am trying to reproduce your experiments, and just running the first command in the readme. My pytorch version is 0.3 as you can see below. I am evaluating after every epoch instead of just the first epoch. As you can see at the bottom of the log the model accuracy is close to 0 even after 8 epochs and the BLEU score is ~ 4.5.

Is this expected behavior ?

$ python -c 'import torch; print(torch.__version__)'
0.3.0.post4
$ python main_train.py -lr=0.001 -layer=1 -hdd=12 -dr=0.0 -dec=Mem2Seq -bsz=2 -ds=kvr -t= -evalp=1
{'dataset': 'kvr', 'task': '', 'decoder': 'Mem2Seq', 'hidden': '12', 'batch': '2', 'learn': '0.001', 'drop': '0.0', 'unk_mask': 1, 'layer': '1', 'limit': -10000, 'path': None, 'test': None, 'sample': None, 'useKB': 1, 'entPtr': 0, 'evalp': '1', 'addName': ''}
08-10 12:47 Reading lines from data/KVR/train.txt
08-10 12:47 Pointer percentace= 0.4208753595747005 
08-10 12:47 Max responce Len: 80
08-10 12:47 Max Input Len: 249
08-10 12:47 Avg. User Utterances: 2.593814432989691
08-10 12:47 Avg. Bot Utterances: 2.593814432989691
08-10 12:47 Avg. KB results: 64.69896907216494
08-10 12:47 Avg. responce Len: 8.732273449920509
Sample:  [['dish_parking', 'poi', 'parking_garage', 'road_block_nearby', '2_miles'], ['2_miles', 'distance', 'dish_parking', 'PAD', 'PAD'], ['road_block_nearby', 'traffic_info', 'dish_parking', 'PAD', 'PAD'], ['parking_garage', 'poi_type', 'dish_parking', 'PAD', 'PAD'], ['550_alester_ave', 'address', 'dish_parking', 'PAD', 'PAD'], ['stanford_oval_parking', 'poi', 'parking_garage', 'no_traffic', '6_miles'], ['6_miles', 'distance', 'stanford_oval_parking', 'PAD', 'PAD'], ['no_traffic', 'traffic_info', 'stanford_oval_parking', 'PAD', 'PAD'], ['parking_garage', 'poi_type', 'stanford_oval_parking', 'PAD', 'PAD'], ['610_amarillo_ave', 'address', 'stanford_oval_parking', 'PAD', 'PAD'], ['willows_market', 'poi', 'grocery_store', 'car_collision_nearby', '4_miles'], ['4_miles', 'distance', 'willows_market', 'PAD', 'PAD'], ['car_collision_nearby', 'traffic_info', 'willows_market', 'PAD', 'PAD'], ['grocery_store', 'poi_type', 'willows_market', 'PAD', 'PAD'], ['409_bollard_st', 'address', 'willows_market', 'PAD', 'PAD'], ['the_westin', 'poi', 'rest_stop', 'moderate_traffic', '2_miles'], ['2_miles', 'distance', 'the_westin', 'PAD', 'PAD'], ['moderate_traffic', 'traffic_info', 'the_westin', 'PAD', 'PAD'], ['rest_stop', 'poi_type', 'the_westin', 'PAD', 'PAD'], ['329_el_camino_real', 'address', 'the_westin', 'PAD', 'PAD'], ['toms_house', 'poi', 'friends_house', 'heavy_traffic', '1_miles'], ['1_miles', 'distance', 'toms_house', 'PAD', 'PAD'], ['heavy_traffic', 'traffic_info', 'toms_house', 'PAD', 'PAD'], ['friends_house', 'poi_type', 'toms_house', 'PAD', 'PAD'], ['580_van_ness_ave', 'address', 'toms_house', 'PAD', 'PAD'], ['pizza_chicago', 'poi', 'pizza_restaurant', 'heavy_traffic', '4_miles'], ['4_miles', 'distance', 'pizza_chicago', 'PAD', 'PAD'], ['heavy_traffic', 'traffic_info', 'pizza_chicago', 'PAD', 'PAD'], ['pizza_restaurant', 'poi_type', 'pizza_chicago', 'PAD', 'PAD'], ['915_arbol_dr', 'address', 'pizza_chicago', 'PAD', 'PAD'], ['valero', 'poi', 'gas_station', 'car_collision_nearby', '6_miles'], ['6_miles', 'distance', 'valero', 'PAD', 'PAD'], ['car_collision_nearby', 'traffic_info', 'valero', 'PAD', 'PAD'], ['gas_station', 'poi_type', 'valero', 'PAD', 'PAD'], ['200_alester_ave', 'address', 'valero', 'PAD', 'PAD'], ['mandarin_roots', 'poi', 'chinese_restaurant', 'no_traffic', '2_miles'], ['2_miles', 'distance', 'mandarin_roots', 'PAD', 'PAD'], ['no_traffic', 'traffic_info', 'mandarin_roots', 'PAD', 'PAD'], ['chinese_restaurant', 'poi_type', 'mandarin_roots', 'PAD', 'PAD'], ['271_springer_street', 'address', 'mandarin_roots', 'PAD', 'PAD'], ['where', '$u', 't1', 'PAD', 'PAD'], ['s', '$u', 't1', 'PAD', 'PAD'], ['the', '$u', 't1', 'PAD', 'PAD'], ['nearest', '$u', 't1', 'PAD', 'PAD'], ['parking_garage', '$u', 't1', 'PAD', 'PAD'], ['the', '$s', 't1', 'PAD', 'PAD'], ['nearest', '$s', 't1', 'PAD', 'PAD'], ['parking_garage', '$s', 't1', 'PAD', 'PAD'], ['is', '$s', 't1', 'PAD', 'PAD'], ['dish_parking', '$s', 't1', 'PAD', 'PAD'], ['at', '$s', 't1', 'PAD', 'PAD'], ['550_alester_ave', '$s', 't1', 'PAD', 'PAD'], ['would', '$s', 't1', 'PAD', 'PAD'], ['you', '$s', 't1', 'PAD', 'PAD'], ['like', '$s', 't1', 'PAD', 'PAD'], ['directions', '$s', 't1', 'PAD', 'PAD'], ['there', '$s', 't1', 'PAD', 'PAD'], ['yes', '$u', 't2', 'PAD', 'PAD'], ['please', '$u', 't2', 'PAD', 'PAD'], ['set', '$u', 't2', 'PAD', 'PAD'], ['directions', '$u', 't2', 'PAD', 'PAD'], ['via', '$u', 't2', 'PAD', 'PAD'], ['a', '$u', 't2', 'PAD', 'PAD'], ['route', '$u', 't2', 'PAD', 'PAD'], ['that', '$u', 't2', 'PAD', 'PAD'], ['avoids', '$u', 't2', 'PAD', 'PAD'], ['all', '$u', 't2', 'PAD', 'PAD'], ['heavy_traffic', '$u', 't2', 'PAD', 'PAD'], ['if', '$u', 't2', 'PAD', 'PAD'], ['possible', '$u', 't2', 'PAD', 'PAD'], ['$$$$', '$$$$', '$$$$', '$$$$', '$$$$']] it looks like there is a road block being reported on the route but i will still find the quickest route to 550_alester_ave [70, 70, 54, 56, 48, 62, 70, 70, 70, 70, 70, 45, 63, 70, 70, 70, 70, 70, 45, 70, 63, 70, 51] [0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1] ['550_alester_ave']
08-10 12:47 Reading lines from data/KVR/dev.txt
08-10 12:47 Pointer percentace= 0.4167286798630749 
08-10 12:47 Max responce Len: 87
08-10 12:47 Max Input Len: 264
08-10 12:47 Avg. User Utterances: 2.5728476821192054
08-10 12:47 Avg. Bot Utterances: 2.5728476821192054
08-10 12:47 Avg. KB results: 63.847682119205295
08-10 12:47 Avg. responce Len: 8.647361647361647
Sample:  [['make', '$u', 't1', 'PAD', 'PAD'], ['an', '$u', 't1', 'PAD', 'PAD'], ['appointment', '$u', 't1', 'PAD', 'PAD'], ['to', '$u', 't1', 'PAD', 'PAD'], ['reserve', '$u', 't1', 'PAD', 'PAD'], ['conference_room_100', '$u', 't1', 'PAD', 'PAD'], ['later', '$u', 't1', 'PAD', 'PAD'], ['this', '$u', 't1', 'PAD', 'PAD'], ['week', '$u', 't1', 'PAD', 'PAD'], ['for', '$u', 't1', 'PAD', 'PAD'], ['a', '$u', 't1', 'PAD', 'PAD'], ['meeting', '$u', 't1', 'PAD', 'PAD'], ['what', '$s', 't1', 'PAD', 'PAD'], ['day', '$s', 't1', 'PAD', 'PAD'], ['and', '$s', 't1', 'PAD', 'PAD'], ['time', '$s', 't1', 'PAD', 'PAD'], ['should', '$s', 't1', 'PAD', 'PAD'], ['i', '$s', 't1', 'PAD', 'PAD'], ['set', '$s', 't1', 'PAD', 'PAD'], ['an', '$s', 't1', 'PAD', 'PAD'], ['appointment', '$s', 't1', 'PAD', 'PAD'], ['to', '$s', 't1', 'PAD', 'PAD'], ['reserve', '$s', 't1', 'PAD', 'PAD'], ['the', '$s', 't1', 'PAD', 'PAD'], ['conference', '$s', 't1', 'PAD', 'PAD'], ['room', '$s', 't1', 'PAD', 'PAD'], ['monday', '$u', 't2', 'PAD', 'PAD'], ['at', '$u', 't2', 'PAD', 'PAD'], ['3pm', '$u', 't2', 'PAD', 'PAD'], ['$$$$', '$$$$', '$$$$', '$$$$', '$$$$']] i have made an appointment for monday at 3pm for the meeting [17, 29, 29, 19, 20, 9, 26, 27, 28, 9, 23, 11] [1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1] ['meeting', 'monday', '3pm']
08-10 12:47 Reading lines from data/KVR/test.txt
08-10 12:47 Pointer percentace= 0.4224432239869378 
08-10 12:47 Max responce Len: 36
08-10 12:47 Max Input Len: 228
08-10 12:47 Avg. User Utterances: 2.6546052631578947
08-10 12:47 Avg. Bot Utterances: 2.6546052631578947
08-10 12:47 Avg. KB results: 64.84539473684211
08-10 12:47 Avg. responce Len: 8.34820322180917
Sample:  [['remind', '$u', 't1', 'PAD', 'PAD'], ['me', '$u', 't1', 'PAD', 'PAD'], ['to', '$u', 't1', 'PAD', 'PAD'], ['take', '$u', 't1', 'PAD', 'PAD'], ['my', '$u', 't1', 'PAD', 'PAD'], ['pills', '$u', 't1', 'PAD', 'PAD'], ['what', '$s', 't1', 'PAD', 'PAD'], ['time', '$s', 't1', 'PAD', 'PAD'], ['do', '$s', 't1', 'PAD', 'PAD'], ['you', '$s', 't1', 'PAD', 'PAD'], ['need', '$s', 't1', 'PAD', 'PAD'], ['to', '$s', 't1', 'PAD', 'PAD'], ['take', '$s', 't1', 'PAD', 'PAD'], ['your', '$s', 't1', 'PAD', 'PAD'], ['pills', '$s', 't1', 'PAD', 'PAD'], ['i', '$u', 't2', 'PAD', 'PAD'], ['need', '$u', 't2', 'PAD', 'PAD'], ['to', '$u', 't2', 'PAD', 'PAD'], ['take', '$u', 't2', 'PAD', 'PAD'], ['my', '$u', 't2', 'PAD', 'PAD'], ['pills', '$u', 't2', 'PAD', 'PAD'], ['at', '$u', 't2', 'PAD', 'PAD'], ['7pm', '$u', 't2', 'PAD', 'PAD'], ['$$$$', '$$$$', '$$$$', '$$$$', '$$$$']] ok setting your medicine appointment for 7pm [23, 23, 13, 23, 23, 23, 22] [0, 0, 1, 0, 0, 0, 1] ['7pm']
08-10 12:47 Read 6290 sentence pairs train
08-10 12:47 Read 777 sentence pairs dev
08-10 12:47 Read 807 sentence pairs test
08-10 12:47 Max len Input 265 
08-10 12:47 Vocab_size 1554 
08-10 12:47 USE_CUDA=False
08-10 12:47 Epoch:0
L:6.63, VL:4.80, PL:1.83: 100%|███████████████████████████| 3145/3145 [00:43<00:00, 72.70it/s]
08-10 12:48 STARTING EVALUATION
R:0.0746,W:77.2260: 100%|███████████████████████████████████| 389/389 [00:17<00:00, 21.75it/s]
08-10 12:48 F1 SCORE:   0.0
08-10 12:48 F1 CAL: 0.0
08-10 12:48 F1 WET: 0.0
08-10 12:48 F1 NAV: 0.0
08-10 12:48 BLEU SCORE:0.0
08-10 12:48 MODEL SAVED
08-10 12:48 Epoch:1
L:5.81, VL:4.15, PL:1.66: 100%|███████████████████████████| 3145/3145 [00:45<00:00, 68.87it/s]
08-10 12:49 STARTING EVALUATION
R:0.0874,W:76.1113: 100%|███████████████████████████████████| 389/389 [00:20<00:00, 19.21it/s]
08-10 12:49 F1 SCORE:   0.00974817221770918
08-10 12:49 F1 CAL: 0.0
08-10 12:49 F1 WET: 0.017167381974248927
08-10 12:49 F1 NAV: 0.009111617312072893
08-10 12:49 BLEU SCORE:0.0
08-10 12:49 MODEL SAVED
08-10 12:49 Epoch:2
L:5.47, VL:3.85, PL:1.62: 100%|███████████████████████████| 3145/3145 [00:47<00:00, 66.57it/s]
08-10 12:50 STARTING EVALUATION
R:0.0900,W:72.6340: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 389/389 [00:23<00:00, 16.66it/s]
08-10 12:51 F1 SCORE:   0.01380991064175467
08-10 12:51 F1 CAL: 0.02147239263803681
08-10 12:51 F1 WET: 0.02145922746781116
08-10 12:51 F1 NAV: 0.0
08-10 12:51 BLEU SCORE:0.0
08-10 12:51 MODEL SAVED
Epoch     2: reducing learning rate of group 0 to 5.0000e-04.
08-10 12:51 Epoch:3
L:5.26, VL:3.68, PL:1.58: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3145/3145 [01:44<00:00, 30.20it/s]
08-10 12:52 STARTING EVALUATION
R:0.0797,W:72.2230: 100%|████████████████████████████████████| 389/389 [00:20<00:00, 18.97it/s]
08-10 12:53 F1 SCORE:   0.01299756295694557
08-10 12:53 F1 CAL: 0.015337423312883437
08-10 12:53 F1 WET: 0.023605150214592276
08-10 12:53 F1 NAV: 0.0
08-10 12:53 BLEU SCORE:1.7
08-10 12:53 MODEL SAVED
08-10 12:53 Epoch:4
L:5.16, VL:3.60, PL:1.56: 100%|████████████████████████████| 3145/3145 [00:54<00:00, 58.01it/s]
08-10 12:54 STARTING EVALUATION
R:0.0874,W:72.3398: 100%|████████████████████████████████████| 389/389 [00:21<00:00, 17.83it/s]
08-10 12:54 F1 SCORE:   0.014622258326563772
08-10 12:54 F1 CAL: 0.05214723926380368
08-10 12:54 F1 WET: 0.002145922746781116
08-10 12:54 F1 NAV: 0.0
08-10 12:54 BLEU SCORE:2.31
08-10 12:54 MODEL SAVED
08-10 12:54 Epoch:5
L:5.09, VL:3.54, PL:1.54: 100%|████████████████████████████| 3145/3145 [00:45<00:00, 69.83it/s]
08-10 12:55 STARTING EVALUATION
R:0.0925,W:70.8605: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 389/389 [00:20<00:00, 19.22it/s]
08-10 12:55 F1 SCORE:   0.01949634443541836
08-10 12:55 F1 CAL: 0.049079754601227
08-10 12:55 F1 WET: 0.015021459227467811
08-10 12:55 F1 NAV: 0.002277904328018223
08-10 12:55 BLEU SCORE:0.0
08-10 12:55 Epoch:6
L:5.02, VL:3.49, PL:1.53: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3145/3145 [01:33<00:00, 33.74it/s]
08-10 12:57 STARTING EVALUATION
R:0.0771,W:72.1792: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 389/389 [00:20<00:00, 19.14it/s]
08-10 12:57 F1 SCORE:   0.008123476848090982
08-10 12:57 F1 CAL: 0.027607361963190184
08-10 12:57 F1 WET: 0.002145922746781116
08-10 12:57 F1 NAV: 0.0
08-10 12:57 BLEU SCORE:2.62
08-10 12:57 MODEL SAVED
08-10 12:57 Epoch:7
L:4.98, VL:3.46, PL:1.52: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3145/3145 [01:15<00:00, 41.40it/s]
08-10 12:58 STARTING EVALUATION
R:0.0900,W:72.8743: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 389/389 [00:24<00:00, 15.91it/s]
08-10 12:59 F1 SCORE:   0.006498781478472785
08-10 12:59 F1 CAL: 0.018404907975460124
08-10 12:59 F1 WET: 0.0
08-10 12:59 F1 NAV: 0.004555808656036446
08-10 12:59 BLEU SCORE:2.22
08-10 12:59 Epoch:8
L:4.94, VL:3.42, PL:1.52: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3145/3145 [01:27<00:00, 35.80it/s]
08-10 13:00 STARTING EVALUATION
R:0.0887,W:71.7623: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 389/389 [00:23<00:00, 16.73it/s]
08-10 13:01 F1 SCORE:   0.016246953696181964
08-10 13:01 F1 CAL: 0.05214723926380368
08-10 13:01 F1 WET: 0.006437768240343348
08-10 13:01 F1 NAV: 0.0
08-10 13:01 BLEU SCORE:4.49
08-10 13:01 MODEL SAVED

jasonwu0731 commented 6 years ago

Hello @se4u

Maybe you should increase your hidden size because 12 is too small to generalize well (underfit). Also, increase a bit dropout rate will help. Please try.

se4u commented 6 years ago

@jasonwu0731 Thank you for your reply. I reran the experiment after setting hdd to 50 and the training seems to work much better than before on Calendar and Weather domain but not Navigation.

08-10 13:15 Epoch:0
L:6.00, VL:4.30, PL:1.70: 100%|███████████████████████████| 3145/3145 [01:32<00:00, 34.04it/s]
08-10 13:16 STARTING EVALUATION
R:0.0925,W:74.8909: 100%|███████████████████████████████████| 389/389 [00:27<00:00, 13.91it/s]
08-10 13:17 F1 SCORE:   0.0008123476848090981
08-10 13:17 F1 CAL: 0.0
08-10 13:17 F1 WET: 0.002145922746781116
08-10 13:17 F1 NAV: 0.0
08-10 13:17 BLEU SCORE:2.57
08-10 13:17 MODEL SAVED
08-10 13:17 Epoch:1
L:4.97, VL:3.49, PL:1.48: 100%|███████████████████████████| 3145/3145 [01:29<00:00, 35.24it/s]
08-10 13:18 STARTING EVALUATION
R:0.0964,W:68.3815: 100%|███████████████████████████████████| 389/389 [00:29<00:00, 13.41it/s]
08-10 13:19 F1 SCORE:   0.06580016246953696
08-10 13:19 F1 CAL: 0.10122699386503067
08-10 13:19 F1 WET: 0.09871244635193133
08-10 13:19 F1 NAV: 0.004555808656036446
08-10 13:19 BLEU SCORE:4.76
08-10 13:19 MODEL SAVED
08-10 13:19 Epoch:2
L:4.63, VL:3.24, PL:1.40: 100%|███████████████████████████| 3145/3145 [01:28<00:00, 35.51it/s]
08-10 13:20 STARTING EVALUATION
R:0.0977,W:68.5270: 100%|███████████████████████████████████| 389/389 [00:29<00:00, 13.01it/s]
08-10 13:20 F1 SCORE:   0.19740048740861085
08-10 13:20 F1 CAL: 0.3128834355828221
08-10 13:20 F1 WET: 0.30257510729613735
08-10 13:20 F1 NAV: 0.0
08-10 13:20 BLEU SCORE:7.6
08-10 13:20 MODEL SAVED
08-10 13:20 Epoch:3

The command that I used with hdd=12 came directly from the README.

❱❱❱ python3 main_train.py -lr=0.001 -layer=1 -hdd=12 -dr=0.0 -dec=Mem2Seq -bsz=2 -ds=kvr -t=

@andreamad8 Can you please update the readme with the parameters used to obtain the results in the paper ?

jasonwu0731 commented 6 years ago

Good to hear that. Let me modify the readme and close the issue.

se4u commented 6 years ago

@jasonwu0731 Thanks for updating the readme. Are these the same hyper-parameters that were used to generate the results in the paper ?

jasonwu0731 commented 6 years ago

@se4u

You can check the appendix in our paper :)

https://arxiv.org/pdf/1804.08217.pdf

se4u commented 6 years ago

@jasonwu0731 Ah got it, thank you. I was looking at your ACL paper that is linked in the README. http://aclweb.org/anthology/P18-1136

I have one last question regarding the train.txt file. How was it generated from the kvret_train_public.json ? Which file contains this preprocessing code ? And can you briefly summarize the format for train.txt. I want to use your model on a different dataset.

jasonwu0731 commented 6 years ago

@se4u

Ah we didnt upload the preprocessing code that we parsed from origin .json file. Please check the .json file since it includes many other information that we didnt use in our paper, such as requests and slots.

train.txt contains domain (start with #), KB information (start with 0) and dialog turns (start with 1,2,3,...).

HLTCHKUST / Mem2Seq

Mem2Seq model accuracy stuck at 0 #3