Open 188080501 opened 3 days ago
好的,我们确认一下哈
目前是必须写入的,写入的原因也是为了保证万一程序中断,可以加载断点继续训练。当然您的建议也很好,我们后续可以考虑增加该能力。
目前是必须写入的,写入的原因也是为了保证万一程序中断,可以加载断点继续训练。当然您的建议也很好,我们后续可以考虑增加该能力。
请问下,这个写入现在必须是每个Epoch都写入吗?能否设置间隔或者直接关闭
可以通过修改模型保存间隔来调整,但是无法关闭,添加参数:-o Train.save_interval=5
可以通过修改模型保存间隔来调整,但是无法关闭,添加参数:-o Train.save_interval=5
我已经修改过 paddlex\configs\image_classification\PP-HGNetV2-B6.yaml 中的eval_interval和save_interval。但是这2个参数,包括你发的这个,都不会影响到latest的保存。设置后还是每个Epoch都会保存latest模型数据一次。
请问这个符合预期吗?
目前是必须写入的,写入的原因也是为了保证万一程序中断,可以加载断点继续训练。当然您的建议也很好,我们后续可以考虑增加该能力。
现在的save_interval可以实现这个,比如说设置20个epoch保存一次,万一中断了可以从第20个epoch,或者第40个epoch开始重新训练。
我个人觉得,如果是为了防止中断问题的话,没有必要每个epoch都保存
刚刚去确认了一下,是无法修改的,我们会评估一下您这里提出的需求,在未来的版本中做相关的优化~
刚刚去确认了一下,是无法修改的,我们会评估一下您这里提出的需求,在未来的版本中做相关的优化~
感谢回复,我自己找一下临时的方案
我的临时方案如下,需要修改2个文件
第一个是: paddlex\repo_manager\repos\PaddleClas\ppcls\utils\save_load.py 在save_model函数的开头,增加如下代码:
if prefix=="latest":
return
第二个是: E:\PaddleX3b2\paddlex\repo_manager\repos\PaddleClas\ppcls\engine\engine.py 在export函数的开头,增加如下代码:
if 'latest' in os.path.abspath(save_path).split(os.sep):
logger.info(f"Skipping export to 'latest' path: {save_path}")
return
修改后,latest的模型不会再保存了,训练的模型拿去推理也是正常的,应该是没有影响?我没有做进一步的验证 而且每个epoch之间的间隔时间明显缩短了,在1秒左右,GPU使用率有明显的提升,修改前要10秒以上,都在等待模型的保存
修改后,训练的部分log如下:
[2024/11/20 17:33:15] ppcls INFO: [Train][Epoch 15/50][Iter: 0/64]lr(LinearWarmup): 0.02803125, top1: 0.40278, top2: 0.65278, CELoss: 1.70701, loss: 1.70701, batch_cost: 0.27744s, reader_cost: 0.04297, ips: 259.51662 samples/s, eta: 0:10:39
[2024/11/20 17:33:18] ppcls INFO: [Train][Epoch 15/50][Iter: 10/64]lr(LinearWarmup): 0.02834375, top1: 0.34217, top2: 0.61237, CELoss: 1.76351, loss: 1.76351, batch_cost: 0.27889s, reader_cost: 0.04166, ips: 258.16577 samples/s, eta: 0:10:39
[2024/11/20 17:33:21] ppcls INFO: [Train][Epoch 15/50][Iter: 20/64]lr(LinearWarmup): 0.02865625, top1: 0.31195, top2: 0.58225, CELoss: 1.79635, loss: 1.79635, batch_cost: 0.30269s, reader_cost: 0.07979, ips: 237.86815 samples/s, eta: 0:11:31
[2024/11/20 17:33:24] ppcls INFO: [Train][Epoch 15/50][Iter: 30/64]lr(LinearWarmup): 0.02896875, top1: 0.30847, top2: 0.58993, CELoss: 1.77478, loss: 1.77478, batch_cost: 0.29790s, reader_cost: 0.05787, ips: 241.68835 samples/s, eta: 0:11:17
[2024/11/20 17:33:27] ppcls INFO: [Train][Epoch 15/50][Iter: 40/64]lr(LinearWarmup): 0.02928125, top1: 0.31256, top2: 0.59002, CELoss: 1.76980, loss: 1.76980, batch_cost: 0.29471s, reader_cost: 0.05649, ips: 244.30908 samples/s, eta: 0:11:07
[2024/11/20 17:33:30] ppcls INFO: [Train][Epoch 15/50][Iter: 50/64]lr(LinearWarmup): 0.02959375, top1: 0.31779, top2: 0.59807, CELoss: 1.76329, loss: 1.76329, batch_cost: 0.30174s, reader_cost: 0.06176, ips: 238.61734 samples/s, eta: 0:11:20
[2024/11/20 17:33:33] ppcls INFO: [Train][Epoch 15/50][Iter: 60/64]lr(LinearWarmup): 0.02990625, top1: 0.32336, top2: 0.59724, CELoss: 1.76220, loss: 1.76220, batch_cost: 0.30819s, reader_cost: 0.06626, ips: 233.62294 samples/s, eta: 0:11:31
[2024/11/20 17:33:34] ppcls INFO: [Train][Epoch 15/50][Avg]top1: 0.32405, top2: 0.59680, CELoss: 1.76033, loss: 1.76033
[2024/11/20 17:33:34] ppcls INFO: Skipping export to 'latest' path: E:\PaddleX3b2\output\latest\inference
[2024/11/20 17:33:34] ppcls INFO: Already save model info in E:\PaddleX3b2\output\latest
[2024/11/20 17:33:35] ppcls INFO: [Train][Epoch 16/50][Iter: 0/64]lr(LinearWarmup): 0.03003125, top1: 0.40278, top2: 0.55556, CELoss: 1.71605, loss: 1.71605, batch_cost: 0.29943s, reader_cost: 0.05862, ips: 240.45346 samples/s, eta: 0:11:10
[2024/11/20 17:33:38] ppcls INFO: [Train][Epoch 16/50][Iter: 10/64]lr(LinearWarmup): 0.03034375, top1: 0.34091, top2: 0.55682, CELoss: 1.75828, loss: 1.75828, batch_cost: 0.29676s, reader_cost: 0.05785, ips: 242.62124 samples/s, eta: 0:11:01
[2024/11/20 17:33:41] ppcls INFO: [Train][Epoch 16/50][Iter: 20/64]lr(LinearWarmup): 0.03065625, top1: 0.34812, top2: 0.59044, CELoss: 1.73452, loss: 1.73452, batch_cost: 0.30205s, reader_cost: 0.07734, ips: 238.37132 samples/s, eta: 0:11:10
[2024/11/20 17:33:44] ppcls INFO: [Train][Epoch 16/50][Iter: 30/64]lr(LinearWarmup): 0.03096875, top1: 0.33593, top2: 0.58307, CELoss: 1.74470, loss: 1.74470, batch_cost: 0.28816s, reader_cost: 0.05122, ips: 249.86037 samples/s, eta: 0:10:36
[2024/11/20 17:33:46] ppcls INFO: [Train][Epoch 16/50][Iter: 40/64]lr(LinearWarmup): 0.03128125, top1: 0.33873, top2: 0.58589, CELoss: 1.74587, loss: 1.74587, batch_cost: 0.28745s, reader_cost: 0.05068, ips: 250.47822 samples/s, eta: 0:10:32
[2024/11/20 17:33:49] ppcls INFO: [Train][Epoch 16/50][Iter: 50/64]lr(LinearWarmup): 0.03159375, top1: 0.34455, top2: 0.58566, CELoss: 1.74945, loss: 1.74945, batch_cost: 0.29490s, reader_cost: 0.06005, ips: 244.15189 samples/s, eta: 0:10:45
[2024/11/20 17:33:52] ppcls INFO: [Train][Epoch 16/50][Iter: 60/64]lr(LinearWarmup): 0.03190625, top1: 0.34085, top2: 0.58803, CELoss: 1.74796, loss: 1.74796, batch_cost: 0.29823s, reader_cost: 0.05858, ips: 241.42764 samples/s, eta: 0:10:50
[2024/11/20 17:33:53] ppcls INFO: [Train][Epoch 16/50][Avg]top1: 0.34313, top2: 0.59088, CELoss: 1.74439, loss: 1.74439
[2024/11/20 17:33:53] ppcls INFO: Skipping export to 'latest' path: E:\PaddleX3b2\output\latest\inference
[2024/11/20 17:33:53] ppcls INFO: Already save model info in E:\PaddleX3b2\output\latest
[2024/11/20 17:33:54] ppcls INFO: [Train][Epoch 17/50][Iter: 0/64]lr(LinearWarmup): 0.03203125, top1: 0.30556, top2: 0.61111, CELoss: 1.67692, loss: 1.67692, batch_cost: 0.29779s, reader_cost: 0.06592, ips: 241.78229 samples/s, eta: 0:10:47
[2024/11/20 17:33:57] ppcls INFO: [Train][Epoch 17/50][Iter: 10/64]lr(LinearWarmup): 0.03234375, top1: 0.34975, top2: 0.62247, CELoss: 1.74289, loss: 1.74289, batch_cost: 0.29540s, reader_cost: 0.05781, ips: 243.74050 samples/s, eta: 0:10:39
[2024/11/20 17:34:00] ppcls INFO: [Train][Epoch 17/50][Iter: 20/64]lr(LinearWarmup): 0.03265625, top1: 0.35017, top2: 0.62253, CELoss: 1.73232, loss: 1.73232, batch_cost: 0.30095s, reader_cost: 0.07678, ips: 239.24604 samples/s, eta: 0:10:48
[2024/11/20 17:34:03] ppcls INFO: [Train][Epoch 17/50][Iter: 30/64]lr(LinearWarmup): 0.03296875, top1: 0.34874, top2: 0.61876, CELoss: 1.73713, loss: 1.73713, batch_cost: 0.29552s, reader_cost: 0.05969, ips: 243.63916 samples/s, eta: 0:10:34
[2024/11/20 17:34:05] ppcls INFO: [Train][Epoch 17/50][Iter: 40/64]lr(LinearWarmup): 0.03328125, top1: 0.33769, top2: 0.60964, CELoss: 1.74524, loss: 1.74524, batch_cost: 0.28793s, reader_cost: 0.05502, ips: 250.06187 samples/s, eta: 0:10:15
[2024/11/20 17:34:08] ppcls INFO: [Train][Epoch 17/50][Iter: 50/64]lr(LinearWarmup): 0.03359375, top1: 0.33214, top2: 0.60717, CELoss: 1.74894, loss: 1.74894, batch_cost: 0.29025s, reader_cost: 0.05369, ips: 248.06092 samples/s, eta: 0:10:17
[2024/11/20 17:34:11] ppcls INFO: [Train][Epoch 17/50][Iter: 60/64]lr(LinearWarmup): 0.03390625, top1: 0.33072, top2: 0.60322, CELoss: 1.75104, loss: 1.75104, batch_cost: 0.29551s, reader_cost: 0.05236, ips: 243.64423 samples/s, eta: 0:10:25
[2024/11/20 17:34:12] ppcls INFO: [Train][Epoch 17/50][Avg]top1: 0.33173, top2: 0.60294, CELoss: 1.75067, loss: 1.75067
[2024/11/20 17:34:12] ppcls INFO: Skipping export to 'latest' path: E:\PaddleX3b2\output\latest\inference
好的,感谢您的建议
我的环境是: PaddleX 3.0-beta2 paddlepaddle-gpu 3.0.0b2 Windows10 ltsc 2019 Python 3.10 CUDA 11.8
我的训练命令是: python main.py -c paddlex/configs/image_classification/PP-HGNetV2-B6.yaml -o Global.mode=train -o Global.dataset_dir=./dataset/Examples/cls_flowers_examples -o Global.device=gpu:0
训练的模型是PP-HGNetV2-B6,训练的过程,训练结果模型的使用都是正常的,这些都没有问题。
但是在训练过程中我观察到一件事,就是每个Epoch结束,都会在output文件夹里写入一个latest模型文件。
我现在用的这个模型,每次保存都会写入839MB的数据,如果我训练500个Epoch,那么就会产生约420GB的硬盘IO数据写入。
长期使用是否会过渡透支SSD的寿命?考虑到现在QLC的硬盘写入寿命并不大。
我不清楚这个latest的写入是不是必须的,或者说有没有什么开关能调节这个写入的频率或者间隔?我尝试调整了eval_interval和save_interval,都对这个latest没有影响,仍然会每个Epoch都会保存。
即使不考虑磁盘IO写入寿命问题,这个写入也会导致不必要的时间等待,这里的逻辑是否有优化空间?
这个是部分的训练log,可以看到每个Epoch结果都会有一个写入: