asyml / texar

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
https://asyml.io
Apache License 2.0
2.39k stars 374 forks source link

Unstable VAE train example #259

Open swapnull7 opened 4 years ago

swapnull7 commented 4 years ago

Training VAE with default params seems unstable. Logs:

train: epoch 0, step 200, nll 150.3577, klw: 0.1153, KL 0.2740,  rc 150.3278, log_ppl 6.7723, ppl 873.2936, time elapsed: 19.7s
train: epoch 0, step 400, nll 148.7724, klw: 0.1305, KL 1.0032,  rc 148.6502, log_ppl 6.6977, ppl 810.5463, time elapsed: 36.8s
train: epoch 0, step 600, nll 148.2995, klw: 0.1457, KL 1.5545,  rc 148.0989, log_ppl 6.6933, ppl 806.9564, time elapsed: 53.9s
train: epoch 0, step 800, nll 147.0893, klw: 0.1609, KL 1.3469,  rc 146.9111, log_ppl 6.6390, ppl 764.3558, time elapsed: 70.9s
train: epoch 0, step 1000, nll nan, klw: 0.1761, KL nan,  rc nan, log_ppl nan, ppl nan, time elapsed: 87.8s
train: epoch 0, step 1200, nll nan, klw: 0.1914, KL nan,  rc nan, log_ppl nan, ppl nan, time elapsed: 104.0s

train: epoch 0, nll nan, KL nan, rc nan, log_ppl nan, ppl nan

valid: epoch 0, nll nan, KL nan, rc nan, log_ppl nan, ppl nan

test: epoch 0, nll nan, KL nan, rc nan, log_ppl nan, ppl nan

train: epoch 1, step 0, nll nan, klw: 0.2002, KL nan,  rc nan, log_ppl nan, ppl nan, time elapsed: 0.2s
train: epoch 1, step 200, nll nan, klw: 0.2154, KL nan,  rc nan, log_ppl nan, ppl nan, time elapsed: 15.5s
train: epoch 1, step 400, nll nan, klw: 0.2306, KL nan,  rc nan, log_ppl nan, ppl nan, time elapsed: 32.6s
train: epoch 1, step 600, nll nan, klw: 0.2458, KL nan,  rc nan, log_ppl nan, ppl nan, time elapsed: 49.4s
train: epoch 1, step 800, nll nan, klw: 0.2610, KL nan,  rc nan, log_ppl nan, ppl nan, time elapsed: 64.2s

Logs from vae-train example from texar-pytorch for reference:


train: epoch 0, step 0, nll 202.0137, klw 0.1002, KL 0.0218, rc 202.0115, log_ppl 9.2349, ppl 10248.7397, time_cost 0.6
train: epoch 0, step 200, nll 145.4623, klw 0.1154, KL 1.2449, rc 145.3253, log_ppl 6.5687, ppl 712.4463, time_cost 23.2
train: epoch 0, step 400, nll 139.3877, klw 0.1306, KL 1.8096, rc 139.1726, log_ppl 6.3007, ppl 544.9752, time_cost 45.7
train: epoch 0, step 600, nll 135.5956, klw 0.1458, KL 2.1700, rc 135.3190, log_ppl 6.1319, ppl 460.2903, time_cost 68.1
train: epoch 0, step 800, nll 133.1483, klw 0.1610, KL 2.3624, rc 132.8281, log_ppl 6.0154, ppl 409.6862, time_cost 90.7
train: epoch 0, step 1000, nll 130.8279, klw 0.1762, KL 2.4912, rc 130.4704, log_ppl 5.9178, ppl 371.5942, time_cost 112.9
train: epoch 0, step 1200, nll 129.0042, klw 0.1914, KL 2.5828, rc 128.6131, log_ppl 5.8383, ppl 343.1805, time_cost 134.9

train: epoch 0, nll 128.1585, KL 2.6265, rc 127.7491, log_ppl 5.7997, ppl 330.2135

valid: epoch 0, nll 119.1858, KL 2.9482, rc 116.2376, log_ppl 5.4454, ppl 231.7005

test: epoch 0, nll 118.1185, KL 2.8654, rc 115.2531, log_ppl 5.3893, ppl 219.0593
train: epoch 1, step 0, nll 117.8860, klw 0.2003, KL 3.2506, rc 117.2353, log_ppl 5.1465, ppl 171.8215, time_cost 0.1
train: epoch 1, step 200, nll 117.5880, klw 0.2155, KL 3.3343, rc 116.8944, log_ppl 5.2860, ppl 197.5439, time_cost 22.2
train: epoch 1, step 400, nll 115.9266, klw 0.2307, KL 3.5084, rc 115.1690, log_ppl 5.2572, ppl 191.9527, time_cost 44.4
train: epoch 1, step 600, nll 115.5976, klw 0.2459, KL 3.6699, rc 114.7753, log_ppl 5.2438, ppl 189.3889, time_cost 66.2
train: epoch 1, step 800, nll 115.2580, klw 0.2611, KL 3.8097, rc 114.3733, log_ppl 5.2201, ppl 184.9485, time_cost 88.1
train: epoch 1, step 1000, nll 114.8173, klw 0.2763, KL 3.9137, rc 113.8769, log_ppl 5.1968, ppl 180.7011, time_cost 109.8
train: epoch 1, step 1200, nll 114.5588, klw 0.2915, KL 3.9752, rc 113.5725, log_ppl 5.1819, ppl 178.0216, time_cost 130.7

train: epoch 1, nll 114.3053, KL 4.0000, rc 113.2953, log_ppl 5.1728, ppl 176.4117

valid: epoch 1, nll 113.7273, KL 4.2768, rc 109.4505, log_ppl 5.1961, ppl 180.5585

test: epoch 1, nll 112.7315, KL 4.1861, rc 108.5454, log_ppl 5.1436, ppl 171.3240
train: epoch 2, step 0, nll 107.0595, klw 0.3004, KL 4.1894, rc 105.8015, log_ppl 4.9940, ppl 147.5299, time_cost 0.1
train: epoch 2, step 200, nll 108.8126, klw 0.3156, KL 4.3093, rc 107.4859, log_ppl 4.9412, ppl 139.9346, time_cost 20.1```