Jittor / jittor

Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators.
https://cg.cs.tsinghua.edu.cn/jittor/
Apache License 2.0
3.08k stars 311 forks source link

训练到epoch50的时候报错 #231

Open xiashuo opened 3 years ago

xiashuo commented 3 years ago

Testing result of epoch 49 miou = 0.8124253827636501 Acc = 0.9737544722860297 Acc_class = 0.8936238910523011 FWIoU = 0.9513300801464021 Best Miou = 0.8124253827636501 epoch =50 iteration = 0 new_lr = 0.0 Training in epoch 50 iteration 0 loss = 0.0976564958691597 epoch =50 iteration = 1 new_lr = (-2.6397669900364505e-07+8.577122885282971e-08j) Traceback (most recent call last): File "/disk_sda/xs/projects/xuexian-jittor/train.py", line 88, in main() File "/disk_sda/xs/projects/xuexian-jittor/train.py", line 83, in main train(model, train_loader, optimizer, epoch, learning_rate, writer) File "/disk_sda/xs/projects/xuexian-jittor/train.py", line 30, in train optimizer.step(loss) File "/home/kqgis/anaconda3/envs/xuexian-jittor/lib/python3.8/site-packages/jittor/optim.py", line 323, in step step_size = lr * jt.sqrt(1-b1n) / (1-b0 n) RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.mul)).

Types of your inputs are: self = complex, b = Var,

The function declarations are: VarHolder multiply(VarHolder x, VarHolder* y)

Failed reason:[f 0622 22:21:17.156459 00 pyjt_jit_op_maker.cc:12084] Not a valid call.

Jittor commented 3 years ago

谢谢您的反馈,报错信息中显示learning rate似乎变成了一个复数,请问您用的是什么lr scheduler呢,您可以检查一下

---原始邮件--- 发件人: @.> 发送时间: 2021年6月23日(周三) 上午9:27 收件人: @.>; 抄送: @.***>; 主题: [Jittor/jittor] 训练到epoch50的时候报错 (#231)

Testing result of epoch 49 miou = 0.8124253827636501 Acc = 0.9737544722860297 Acc_class = 0.8936238910523011 FWIoU = 0.9513300801464021 Best Miou = 0.8124253827636501 epoch =50 iteration = 0 new_lr = 0.0 Training in epoch 50 iteration 0 loss = 0.0976564958691597 epoch =50 iteration = 1 new_lr = (-2.6397669900364505e-07+8.577122885282971e-08j) Traceback (most recent call last): File "/disk_sda/xs/projects/xuexian-jittor/train.py", line 88, in main() File "/disk_sda/xs/projects/xuexian-jittor/train.py", line 83, in main train(model, train_loader, optimizer, epoch, learning_rate, writer) File "/disk_sda/xs/projects/xuexian-jittor/train.py", line 30, in train optimizer.step(loss) File "/home/kqgis/anaconda3/envs/xuexian-jittor/lib/python3.8/site-packages/jittor/optim.py", line 323, in step step_size = lr * jt.sqrt(1-b1n) / (1-b0 n) RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.mul)).

Types of your inputs are: self= complex, b= Var,

The function declarations are: VarHolder multiply(VarHolder x, VarHolder* y)

Failed reason:[f 0622 22:21:17.156459 00 pyjt_jit_op_maker.cc:12084] Not a valid call.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

xiashuo commented 3 years ago

def poly_lr_scheduler(opt, init_lr, iter, epoch, max_iter, max_epoch): new_lr = init_lr (1 - float(epoch max_iter + iter) / (max_epoch * max_iter)) ** 0.9 opt.lr = new_lr

print("epoch ={} iteration = {} new_lr = {}".format(epoch, iter, new_lr))用的jittor官方的deeplabv3代码里的,您看一下

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2021年6月23日(星期三) 上午9:31 收件人: @.>; 抄送: "(6X13N4) @.>; @.>; 主题: Re: [Jittor/jittor] 训练到epoch50的时候报错 (#231)

谢谢您的反馈,报错信息中显示learning rate似乎变成了一个复数,请问您用的是什么lr scheduler呢,您可以检查一下

---原始邮件--- 发件人: @.> 发送时间: 2021年6月23日(周三) 上午9:27 收件人: @.>; 抄送: @.***>; 主题: [Jittor/jittor] 训练到epoch50的时候报错 (#231)

Testing result of epoch 49 miou = 0.8124253827636501 Acc = 0.9737544722860297 Acc_class = 0.8936238910523011 FWIoU = 0.9513300801464021 Best Miou = 0.8124253827636501 epoch =50 iteration = 0 new_lr = 0.0 Training in epoch 50 iteration 0 loss = 0.0976564958691597 epoch =50 iteration = 1 new_lr = (-2.6397669900364505e-07+8.577122885282971e-08j) Traceback (most recent call last): File "/disk_sda/xs/projects/xuexian-jittor/train.py", line 88, in
main() File "/disk_sda/xs/projects/xuexian-jittor/train.py", line 83, in main train(model, train_loader, optimizer, epoch, learning_rate, writer) File "/disk_sda/xs/projects/xuexian-jittor/train.py", line 30, in train optimizer.step(loss) File "/home/kqgis/anaconda3/envs/xuexian-jittor/lib/python3.8/site-packages/jittor/optim.py", line 323, in step step_size = lr * jt.sqrt(1-b1n) / (1-b0 n) RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.mul)).

Types of your inputs are: self= complex, b= Var,

The function declarations are: VarHolder multiply(VarHolder x, VarHolder* y)

Failed reason:[f 0622 22:21:17.156459 00 pyjt_jit_op_maker.cc:12084] Not a valid call.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

xiashuo commented 3 years ago

Training in epoch 49 iteration 170 loss = 0.07775118947029114 epoch =49 iteration = 171 new_lr = 1.8036001240646119e-06 Training in epoch 49 iteration 171 loss = 0.08170900493860245 epoch =49 iteration = 172 new_lr = 1.599364692153999e-06 Training in epoch 49 iteration 172 loss = 0.1264967918395996 epoch =49 iteration = 173 new_lr = 1.3921800100907519e-06 Training in epoch 49 iteration 173 loss = 0.12147299945354462 epoch =49 iteration = 174 new_lr = 1.1814960448007202e-06 Training in epoch 49 iteration 174 loss = 0.07476350665092468 epoch =49 iteration = 175 new_lr = 9.665253749997004e-07 Training in epoch 49 iteration 175 loss = 0.10374470055103302 epoch =49 iteration = 176 new_lr = 7.460507949443883e-07 Training in epoch 49 iteration 176 loss = 0.08783818036317825 epoch =49 iteration = 177 new_lr = 5.179481238964248e-07 Training in epoch 49 iteration 177 loss = 0.059234023094177246 epoch =49 iteration = 178 new_lr = 2.7756152708120006e-07 Training in epoch 49 iteration 178 loss = 0.08562503010034561 Test in epoch 49 iteration 0 Test in epoch 49 iteration 1 Test in epoch 49 iteration 2

前49轮,lr都是正常的,到第50轮,lr就突然变成了复数,这个lr scheduler是jittor官方的代码

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2021年6月23日(星期三) 上午9:31 收件人: @.>; 抄送: "(6X13N4) @.>; @.>; 主题: Re: [Jittor/jittor] 训练到epoch50的时候报错 (#231)

谢谢您的反馈,报错信息中显示learning rate似乎变成了一个复数,请问您用的是什么lr scheduler呢,您可以检查一下

---原始邮件--- 发件人: @.> 发送时间: 2021年6月23日(周三) 上午9:27 收件人: @.>; 抄送: @.***>; 主题: [Jittor/jittor] 训练到epoch50的时候报错 (#231)

Testing result of epoch 49 miou = 0.8124253827636501 Acc = 0.9737544722860297 Acc_class = 0.8936238910523011 FWIoU = 0.9513300801464021 Best Miou = 0.8124253827636501 epoch =50 iteration = 0 new_lr = 0.0 Training in epoch 50 iteration 0 loss = 0.0976564958691597 epoch =50 iteration = 1 new_lr = (-2.6397669900364505e-07+8.577122885282971e-08j) Traceback (most recent call last): File "/disk_sda/xs/projects/xuexian-jittor/train.py", line 88, in
main() File "/disk_sda/xs/projects/xuexian-jittor/train.py", line 83, in main train(model, train_loader, optimizer, epoch, learning_rate, writer) File "/disk_sda/xs/projects/xuexian-jittor/train.py", line 30, in train optimizer.step(loss) File "/home/kqgis/anaconda3/envs/xuexian-jittor/lib/python3.8/site-packages/jittor/optim.py", line 323, in step step_size = lr * jt.sqrt(1-b1n) / (1-b0 n) RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.mul)).

Types of your inputs are: self= complex, b= Var,

The function declarations are: VarHolder multiply(VarHolder x, VarHolder* y)

Failed reason:[f 0622 22:21:17.156459 00 pyjt_jit_op_maker.cc:12084] Not a valid call.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

MenghaoGuo commented 3 years ago

请问一下,您第 50 epoch 时候的 poly_lr_scheduler(opt, init_lr, iter, epoch, max_iter, max_epoch) 函数中 epoch 和 max_epoch 的实参是什么呢 ?

xiashuo commented 3 years ago

应该是您说的这个问题,我已经发现了,因为之前官方deeplab代码里train函数里调用poly_lr_scheduler函数时,max_epoch参数是写死的50,我没改,抱歉

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2021年6月23日(星期三) 上午10:23 收件人: @.>; 抄送: "(6X13N4) @.>; @.>; 主题: Re: [Jittor/jittor] 训练到epoch50的时候报错 (#231)

请问一下,您第 50 epoch 时候的 poly_lr_scheduler(opt, init_lr, iter, epoch, max_iter, max_epoch) 函数中 epoch 和 max_epoch 的实参是什么呢 ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

xiashuo commented 3 years ago

请问一下,您第 50 epoch 时候的 poly_lr_scheduler(opt, init_lr, iter, epoch, max_iter, max_epoch) 函数中 epoch 和 max_epoch 的实参是什么呢 ?

谢谢,应该是这个问题