Open Lipanw opened 2 years ago
我是在windows10系统运行的
请问你运行了多长时间呢,有其他信息么~
Traceback (most recent call last):
File "train.py", line 204, in
2022-04-27 17:47:02 [INFO] [TRAIN] epoch: 0, iter: 100/15000, loss: 2.4868, DSC: 4.1360, lr: 0.009941, batch_cost: 0.7021, reader_cost: 0.00082, ips: 1.4244 samples/sec | ETA 02:54:20 2022-04-27 17:48:13 [INFO] [TRAIN] epoch: 1, iter: 200/15000, loss: 1.1843, DSC: 4.3465, lr: 0.009881, batch_cost: 0.7081, reader_cost: 0.00062, ips: 1.4123 samples/sec | ETA 02:54:39 2022-04-27 17:49:24 [INFO] [TRAIN] epoch: 2, iter: 300/15000, loss: 1.1282, DSC: 4.3768, lr: 0.009820, batch_cost: 0.7096, reader_cost: 0.00016, ips: 1.4092 samples/sec | ETA 02:53:51 2022-04-27 17:50:35 [INFO] [TRAIN] epoch: 2, iter: 400/15000, loss: 1.1043, DSC: 4.3364, lr: 0.009760, batch_cost: 0.7107, reader_cost: 0.00047, ips: 1.4071 samples/sec | ETA 02:52:56 2022-04-27 17:51:46 [INFO] [TRAIN] epoch: 3, iter: 500/15000, loss: 1.0901, DSC: 4.3506, lr: 0.009700, batch_cost: 0.7109, reader_cost: 0.00047, ips: 1.4066 samples/sec | ETA 02:51:48 2022-04-27 17:51:46 [INFO] Start evaluating (total_samples: 5, total_iters: 5)...
每次都是运行到500,要进行模型评估的时候就停止运行了
听起来是验证的时候有点问题,在issue之后我们代码有更新,可以pull一下,save_interval开小一点尝试一下
这部分是在评估过程中保存存在问题,你可以先注释掉save_array部分开始训练,然后在这附上完整的可复现代码链接/修改的部分说明。
2022-05-06 14:56:46 [INFO] [TRAIN] epoch: 4, iter: 100/15000, loss: 4.4847, DSC: 3.7124, lr: 0.000994, batch_cost: 6.5770, reader_cost: 2.26782, ips: 0.9123 samples/sec | ETA 27:13:17 您好,之前的问题已经解决,但是相对于您在首页给的lr=0.001的例子DSC为什么这么低呢,loss也很高
2022-05-06 14:45:45 [INFO]
---------------Config Information---------------
batch_size: 6
data_root: tools/data
iters: 15000
loss:
coef:
2022-05-06 14:56:46 [INFO] [TRAIN] epoch: 4, iter: 100/15000, loss: 4.4847, DSC: 3.7124, lr: 0.000994, batch_cost: 6.5770, reader_cost: 2.26782, ips: 0.9123 samples/sec | ETA 27:13:17 2022-05-06 15:07:41 [INFO] [TRAIN] epoch: 8, iter: 200/15000, loss: 3.5398, DSC: 3.8685, lr: 0.000988, batch_cost: 6.5488, reader_cost: 2.25564, ips: 0.9162 samples/sec | ETA 26:55:22 2022-05-06 15:18:36 [INFO] [TRAIN] epoch: 12, iter: 300/15000, loss: 2.8668, DSC: 3.9746, lr: 0.000982, batch_cost: 6.5445, reader_cost: 2.25206, ips: 0.9168 samples/sec | ETA 26:43:24
lr可能可以适当大一点
一个问题可以只开一个issue。 另外看上去是数据的问题,是否有修改数据处理部分的代码呢?或者罗列下你都进行了什么修改?
请问为什么运行run-vnet.sh后,没有保存训练模型到best_model。train.log也没有任何内容