Loss一开始很大，没几步就变成Nan值，学习率调来调去也没用，麻烦各位大神看一下，很急，有辛苦费

isyanan1024 commented 5 years ago

用的数据是代码自带的data_set里的val_set，训练集和测试集都是一样的，但是训练了几步就变Nan了，[0/300][0/32] Loss: 329.707031 [0/300][1/32] Loss: 320.976501 [0/300][2/32] Loss: 290.012207 [0/300][3/32] Loss: 224.290970 [0/300][4/32] Loss: 172.834579 [0/300][5/32] Loss: nan [0/300][6/32] Loss: nan [0/300][7/32] Loss: nan [0/300][8/32] Loss: nan [0/300][9/32] Loss: nan [0/300][10/32] Loss: nan [0/300][11/32] Loss: nan [0/300][12/32] Loss: nan [0/300][13/32] Loss: nan [0/300][14/32] Loss: nan [0/300][15/32] Loss: nan [0/300][16/32] Loss: nan [0/300][17/32] Loss: nan [0/300][18/32] Loss: nan [0/300][19/32] Loss: nan [0/300][20/32] Loss: nan [0/300][21/32] Loss: nan [0/300][22/32] Loss: nan [0/300][23/32] Loss: nan [0/300][24/32] Loss: nan [0/300][25/32] Loss: nan [0/300][26/32] Loss: nan [0/300][27/32] Loss: nan [0/300][28/32] Loss: nan [0/300][29/32] Loss: nan [0/300][30/32] Loss: nan [0/300][31/32] Loss: nan Start val [0/300][0/63] [0/300][1/63] [0/300][2/63] [0/300][3/63] [0/300][4/63] [0/300][5/63] [0/300][6/63] [0/300][7/63] [0/300][8/63] [0/300][9/63] [0/300][10/63] [0/300][11/63] [0/300][12/63] [0/300][13/63] [0/300][14/63] [0/300][15/63] [0/300][16/63] [0/300][17/63] [0/300][18/63] [0/300][19/63] [0/300][20/63] [0/300][21/63] [0/300][22/63] [0/300][23/63] [0/300][24/63] [0/300][25/63] [0/300][26/63] [0/300][27/63] [0/300][28/63] [0/300][29/63] [0/300][30/63] [0/300][31/63] [0/300][32/63] [0/300][33/63] [0/300][34/63] [0/300][35/63] [0/300][36/63] [0/300][37/63] [0/300][38/63] [0/300][39/63] [0/300][40/63] [0/300][41/63] [0/300][42/63] [0/300][43/63] [0/300][44/63] [0/300][45/63] [0/300][46/63] [0/300][47/63] [0/300][48/63] [0/300][49/63] [0/300][50/63] [0/300][51/63] [0/300][52/63] [0/300][53/63] [0/300][54/63] [0/300][55/63] [0/300][56/63] [0/300][57/63] [0/300][58/63] [0/300][59/63] [0/300][60/63] [0/300][61/63] [0/300][62/63] ----------------------------------------- => , gt: 杨有限责任公司(自然
----------------------------------------- => , gt: 龙源汽车租赁有限公司
----------------------------------------- => , gt: 责任公司(自然人投资
----------------------------------------- => , gt: 板、装饰木片及各类木
----------------------------------------- => , gt: 08-282000-
----------------------------------------- => , gt: 询；销售家用电器、建
----------------------------------------- => , gt: 目：纸箱的加工、销售
----------------------------------------- => , gt: -05-21有限责任
这是什么原因呢？是需要重新调整alphabets.py嘛？

isyanan1024 commented 5 years ago

很急，希望得到大神的帮助。vx:an10246115

isyanan1024 commented 5 years ago

问题已解决，还真是alphabets.py没调整的问题。现在虽然有loss了，但是loss一直很大 ----------------------------------------- => , gt: 4150020011
----------------------------------------- => , gt: 市海淀区海淀路甲31
----------------------------------------- => , gt: 泰米业有限责任公司陈
----------------------------------------- => , gt: 田品牌汽车、进口本田
----------------------------------------- => , gt: 330安徽省合肥市长
----------------------------------------- => , gt: 用电器，厨房设备，通
----------------------------------------- => , gt: i2003-12-3
----------------------------------------- => , gt: 制品，药物日用化工制
1 16000 Test loss: 111.667801, accuray: 0.000063 is best accuracy: False [96/300][0/32] Loss: 109.684082 [96/300][1/32] Loss: 111.541206 [96/300][2/32] Loss: 111.383125 [96/300][3/32] Loss: 112.530624 [96/300][4/32] Loss: 115.973618 [96/300][5/32] Loss: 113.854782 [96/300][6/32] Loss: 113.519562 [96/300][7/32] Loss: 112.275909 [96/300][8/32] Loss: 110.232780 [96/300][9/32] Loss: 110.839699 [96/300][10/32] Loss: 113.969078 [96/300][11/32] Loss: 113.656281 [96/300][12/32] Loss: 110.435341 [96/300][13/32] Loss: 113.886261 [96/300][14/32] Loss: 113.598152 [96/300][15/32] Loss: 115.630997 [96/300][16/32] Loss: 115.028008 [96/300][17/32] Loss: 111.757317 [96/300][18/32] Loss: 113.590530 [96/300][19/32] Loss: 115.950989 [96/300][20/32] Loss: 116.978439 [96/300][21/32] Loss: 112.216919 [96/300][22/32] Loss: 112.183640 [96/300][23/32] Loss: 112.774689 [96/300][24/32] Loss: 113.754906 [96/300][25/32] Loss: 114.368225 [96/300][26/32] Loss: 116.469353 [96/300][27/32] Loss: 112.145462 [96/300][28/32] Loss: 111.360512 [96/300][29/32] Loss: 112.776840 [96/300][30/32] Loss: 113.966568 [96/300][31/32] Loss: 114.328117 这是什么原因？

JisongXie commented 5 years ago

@isyanan1024 Hello, which adjustments need to be make in alphabets.py?(你好，请问alphabets.py需要调整什么啊?) I encounted this problem, too.(我也遇到了这个问题) Can you explain it in detailes, please.(你可以详细解释一下嘛？)

isyanan1024 commented 5 years ago

@JisongXie if you train.txt is 'curry' 'james' 'kobe' 'yanan'. your alphabets.py should be alphabet = """curyjamesan""".

JisongXie commented 5 years ago

@isyanan1024 Okay, I will have a try. Thanks!

JisongXie commented 5 years ago

@isyanan1024 I download the 3.6 million chinese characters dataset, adn download the train.txt and test.txt from BaiduYunpan(password:w877), which is provided by the author@Sierkinhane . Modify the image path and txt path in crnn_main_v2.py. When I run it, it appears nan.

As you said, alphabets.py need to adjust, but the train.txt is like below: So, you mean I extract the character after jpg in each line, and merge them together to be alphabet?

JisongXie commented 5 years ago

@isyanan1024 I try and it still output nan. I know the alphabets is set of the possible characters. I also adjust the learning rate. When the lr is set high, it's more quickly to output nan. It's confusing. It's always output nan at the 1019th batch, even though I make lr smaller.

isyanan1024 commented 5 years ago

@JisongXie you can change batchSize from 32 to 2.I do this and it works.

JisongXie commented 5 years ago

@isyanan1024 but it will train slow. Well, now I modify the param as below, it doesn't output nan tiil now.

isyanan1024 commented 5 years ago

maybe you can try 4、8 or 16,and you should try this: optimizer = optim.Adam(crnn.parameters(), lr=params.lr,betas=(params.beta1, 0.999))

JisongXie commented 5 years ago

@isyanan1024 yes, now I use the adam optimizer. I have a lot of gpu, but when I raise the batchsize up to e.g. 128, it will output nan, too. So I raise down the batchsize now. Thanks for your advice!

wi162yyxq commented 5 years ago

我试了好几天，我发现只要调GPU就完蛋，CPU跑的好好的，奇了怪了

JisongXie commented 5 years ago

@wi162yyxq yeah, I use the cpu to run, it work fine.(是的，我用CPU就可以跑)However, with gpu, I find it will finally get nan, and the loss go around 150.(但用gpu最后都会nan，并且loss在150左右波动) When I use cpu, the loss will go below 150.（使用cpu，loss可以降到150以下)

Cocoalate commented 5 years ago

@JisongXie if you train.txt is 'curry' 'james' 'kobe' 'yanan'. your alphabets.py should be alphabet = """curyjamesan""".

请问您说的是去重么，要怎么调整alphebets.py？您举的例子kobe忘记放进去了吧？

JisongXie commented 5 years ago

@Cocoalate The alphebets.py should contain all characters in your train data. It's a characters set. Of course it don't contains duplicated characters. I give the solution to solve 'nan' problem here. Please see my reference.

gdxytim commented 5 years ago

@Cocoalate The alphebets.py should contain all characters in your train data. It's a characters set. Of course it don't contains duplicated characters. I give the solution to solve 'nan' problem here. Please see my reference.

but i train it must be use the train and val characters .otherwise ,it print a error ' 字' when valid.

JisongXie commented 5 years ago

@gdxytim Yes, of course. Both train and val characters. It's obvious.

gdxytim commented 5 years ago

@gdxytim Yes, of course. Both train and val characters. It's obvious.

I train dataset use the 'crnn_main.py' is ok ,but use the 'crnn_main_v2.py' print "_src.empty() in function 'cv::cvtColor'" ,why? the same dataset. 0 byte image? thank you

Cocoalate commented 5 years ago

@gdxytim I think crnn_main.py uses the lmdb file generated from image and crnn_main_v2.py uses image itself. You should download the 3.6 million dataset and change the directory path in crnn_main_v2.py

gdxytim commented 5 years ago

generated

thank you .but I need to use my data.

Sierkinhane / CRNN_Chinese_Characters_Rec

Loss一开始很大，没几步就变成Nan值，学习率调来调去也没用，麻烦各位大神看一下，很急，有辛苦费 #134