cdyangbo / end2endASR

implement end-to-end asr algorithm with tensorflow
40 stars 22 forks source link

nan problem when training #7

Open huangjian2015 opened 5 years ago

huangjian2015 commented 5 years ago

Hello, Thank for your contribution. I encountered one problem. After one epoch, the loss would be nan like

Epoch 1: 28%|#################7 | 47/167 [10:03<25:40, 12.84s/it, acc=24.5, loss=260, step=47]I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 69102 get requests, put_count=76956 evicted_count=7000 eviction_rate=0.0909611 and unsatisfied allocation rate=0 Epoch 1: 29%|##################4 | 49/167 [10:24<25:04, 12.75s/it, acc=25.5, loss=237, step=49]I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 16280 get requests, put_count=18313 evicted_count=1000 eviction_rate=0.054606 and unsatisfied allocation rate=0 Epoch 1: 30%|##################8 | 50/167 [10:35<24:47, 12.71s/it, acc=24.3, loss=262, step=50]I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 43740 get requests, put_count=48876 evicted_count=4000 eviction_rate=0.0818398 and unsatisfied allocation rate=0 Epoch 1: 99%|##############################################################6| 166/167 [32:06<00:11, 11.61s/it, acc=32, loss=nan, step=166]wait! Epoch 1: 100%|###############################################################| 167/167 [32:17<00:00, 11.60s/it, acc=32, loss=nan, step=167] Epoch 2: 13%|########4 | 22/167 [03:51<25:23, 10.51s/it, acc=32, loss=nan, step=189]I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 28810067 get requests, put_count=28811000 evicted_count=2000 eviction_rate=6.94179e-05 and unsatisfied allocation rate=9.47585e-05 Epoch 2: 99%|##############################################################6| 166/167 [29:29<00:10, 10.66s/it, acc=32, loss=nan, step=333]wait! Epoch 2: 100%|###############################################################| 167/167 [29:40<00:00, 10.66s/it, acc=32, loss=nan, step=334

Did you encounter this problem?

cdyangbo commented 5 years ago

reduce learn rate fine adjust batch-size and learn rate