PaddlePaddle / PaddleX

All-in-One Development Tool based on PaddlePaddle(飞桨低代码开发工具)
Apache License 2.0
4.92k stars 961 forks source link

PaddleX 3.0beta2,训练图片分类模型,如何避免频繁的latest模型硬盘IO数据写入?这可能导致硬盘写入量过大以及不必要的时间等待 #2513

Open 188080501 opened 3 days ago

188080501 commented 3 days ago

我的环境是: PaddleX 3.0-beta2 paddlepaddle-gpu 3.0.0b2 Windows10 ltsc 2019 Python 3.10 CUDA 11.8

我的训练命令是: python main.py -c paddlex/configs/image_classification/PP-HGNetV2-B6.yaml -o Global.mode=train -o Global.dataset_dir=./dataset/Examples/cls_flowers_examples -o Global.device=gpu:0

训练的模型是PP-HGNetV2-B6,训练的过程,训练结果模型的使用都是正常的,这些都没有问题。

但是在训练过程中我观察到一件事,就是每个Epoch结束,都会在output文件夹里写入一个latest模型文件。

我现在用的这个模型,每次保存都会写入839MB的数据,如果我训练500个Epoch,那么就会产生约420GB的硬盘IO数据写入。

长期使用是否会过渡透支SSD的寿命?考虑到现在QLC的硬盘写入寿命并不大。

我不清楚这个latest的写入是不是必须的,或者说有没有什么开关能调节这个写入的频率或者间隔?我尝试调整了eval_interval和save_interval,都对这个latest没有影响,仍然会每个Epoch都会保存。

即使不考虑磁盘IO写入寿命问题,这个写入也会导致不必要的时间等待,这里的逻辑是否有优化空间?

这个是部分的训练log,可以看到每个Epoch结果都会有一个写入:

[2024/11/19 15:41:16] ppcls INFO: [Train][Epoch 5/200][Iter: 0/64]lr(LinearWarmup): 0.00803125, top1: 0.30556, top2: 0.52778, CELoss: 1.82607, loss: 1.82607, batch_cost: 0.36312s, reader_cost: 0.03580, ips: 198.27934 samples/s, eta: 1:15:55
[2024/11/19 15:41:19] ppcls INFO: [Train][Epoch 5/200][Iter: 10/64]lr(LinearWarmup): 0.00834375, top1: 0.33289, top2: 0.54765, CELoss: 1.80689, loss: 1.80689, batch_cost: 0.29032s, reader_cost: 0.05674, ips: 248.00506 samples/s, eta: 1:00:38
[2024/11/19 15:41:22] ppcls INFO: [Train][Epoch 5/200][Iter: 20/64]lr(LinearWarmup): 0.00865625, top1: 0.32082, top2: 0.55768, CELoss: 1.80371, loss: 1.80371, batch_cost: 0.28872s, reader_cost: 0.05230, ips: 249.38087 samples/s, eta: 1:00:15
[2024/11/19 15:41:25] ppcls INFO: [Train][Epoch 5/200][Iter: 30/64]lr(LinearWarmup): 0.00896875, top1: 0.31899, top2: 0.56247, CELoss: 1.79829, loss: 1.79829, batch_cost: 0.30721s, reader_cost: 0.07179, ips: 234.36565 samples/s, eta: 1:04:04
[2024/11/19 15:41:28] ppcls INFO: [Train][Epoch 5/200][Iter: 40/64]lr(LinearWarmup): 0.00928125, top1: 0.31429, top2: 0.55663, CELoss: 1.80130, loss: 1.80130, batch_cost: 0.30344s, reader_cost: 0.06693, ips: 237.28127 samples/s, eta: 1:03:14
[2024/11/19 15:41:31] ppcls INFO: [Train][Epoch 5/200][Iter: 50/64]lr(LinearWarmup): 0.00959375, top1: 0.30952, top2: 0.55945, CELoss: 1.80375, loss: 1.80375, batch_cost: 0.30091s, reader_cost: 0.06299, ips: 239.27748 samples/s, eta: 1:02:39
[2024/11/19 15:41:34] ppcls INFO: [Train][Epoch 5/200][Iter: 60/64]lr(LinearWarmup): 0.00990625, top1: 0.31162, top2: 0.56157, CELoss: 1.80387, loss: 1.80387, batch_cost: 0.29935s, reader_cost: 0.06248, ips: 240.52504 samples/s, eta: 1:02:17
[2024/11/19 15:41:35] ppcls INFO: [Train][Epoch 5/200][Avg]top1: 0.31375, top2: 0.56391, CELoss: 1.80203, loss: 1.80203
[2024/11/19 15:41:37] ppcls INFO: Already save model in E:\PaddleX3b2\output\latest\latest
[2024/11/19 15:41:46] ppcls INFO: Export inference config file to E:\PaddleX3b2\output\latest\inference\inference.yml
[2024/11/19 15:41:46] ppcls INFO: Export succeeded! The inference model exported has been saved in "E:\PaddleX3b2\output\latest\inference\inference".
[2024/11/19 15:41:47] ppcls INFO: Already save model info in E:\PaddleX3b2\output\latest
[2024/11/19 15:41:47] ppcls INFO: [Train][Epoch 6/200][Iter: 0/64]lr(LinearWarmup): 0.01003125, top1: 0.30556, top2: 0.50000, CELoss: 1.84407, loss: 1.84407, batch_cost: 0.35583s, reader_cost: 0.05355, ips: 202.34455 samples/s, eta: 1:14:00
[2024/11/19 15:41:50] ppcls INFO: [Train][Epoch 6/200][Iter: 10/64]lr(LinearWarmup): 0.01034375, top1: 0.27383, top2: 0.54094, CELoss: 1.82967, loss: 1.82967, batch_cost: 0.34692s, reader_cost: 0.10457, ips: 207.53880 samples/s, eta: 1:12:06
[2024/11/19 15:41:53] ppcls INFO: [Train][Epoch 6/200][Iter: 20/64]lr(LinearWarmup): 0.01065625, top1: 0.30171, top2: 0.55904, CELoss: 1.81121, loss: 1.81121, batch_cost: 0.30611s, reader_cost: 0.06687, ips: 235.21259 samples/s, eta: 1:03:34
[2024/11/19 15:41:56] ppcls INFO: [Train][Epoch 6/200][Iter: 30/64]lr(LinearWarmup): 0.01096875, top1: 0.31350, top2: 0.55927, CELoss: 1.79922, loss: 1.79922, batch_cost: 0.30571s, reader_cost: 0.06574, ips: 235.51368 samples/s, eta: 1:03:26
[2024/11/19 15:41:59] ppcls INFO: [Train][Epoch 6/200][Iter: 40/64]lr(LinearWarmup): 0.01128125, top1: 0.31842, top2: 0.56110, CELoss: 1.79976, loss: 1.79976, batch_cost: 0.30363s, reader_cost: 0.06679, ips: 237.13441 samples/s, eta: 1:02:57
[2024/11/19 15:42:02] ppcls INFO: [Train][Epoch 6/200][Iter: 50/64]lr(LinearWarmup): 0.01159375, top1: 0.31834, top2: 0.55945, CELoss: 1.79855, loss: 1.79855, batch_cost: 0.30660s, reader_cost: 0.06831, ips: 234.83096 samples/s, eta: 1:03:31
[2024/11/19 15:42:05] ppcls INFO: [Train][Epoch 6/200][Iter: 60/64]lr(LinearWarmup): 0.01190625, top1: 0.31922, top2: 0.56064, CELoss: 1.79549, loss: 1.79549, batch_cost: 0.30915s, reader_cost: 0.07311, ips: 232.89494 samples/s, eta: 1:03:59
[2024/11/19 15:42:06] ppcls INFO: [Train][Epoch 6/200][Avg]top1: 0.31901, top2: 0.56216, CELoss: 1.79747, loss: 1.79747
[2024/11/19 15:42:08] ppcls INFO: Already save model in E:\PaddleX3b2\output\latest\latest
[2024/11/19 15:42:18] ppcls INFO: Export inference config file to E:\PaddleX3b2\output\latest\inference\inference.yml
[2024/11/19 15:42:18] ppcls INFO: Export succeeded! The inference model exported has been saved in "E:\PaddleX3b2\output\latest\inference\inference".
[2024/11/19 15:42:18] ppcls INFO: Already save model info in E:\PaddleX3b2\output\latest
[2024/11/19 15:42:18] ppcls INFO: [Train][Epoch 7/200][Iter: 0/64]lr(LinearWarmup): 0.01203125, top1: 0.29167, top2: 0.51389, CELoss: 1.86679, loss: 1.86679, batch_cost: 0.34317s, reader_cost: 0.03906, ips: 209.81038 samples/s, eta: 1:11:00
[2024/11/19 15:42:21] ppcls INFO: [Train][Epoch 7/200][Iter: 10/64]lr(LinearWarmup): 0.01234375, top1: 0.31409, top2: 0.59329, CELoss: 1.79527, loss: 1.79527, batch_cost: 0.30779s, reader_cost: 0.09260, ips: 233.92854 samples/s, eta: 1:03:38
[2024/11/19 15:42:24] ppcls INFO: [Train][Epoch 7/200][Iter: 20/64]lr(LinearWarmup): 0.01265625, top1: 0.31945, top2: 0.57065, CELoss: 1.80427, loss: 1.80427, batch_cost: 0.29439s, reader_cost: 0.05821, ips: 244.57298 samples/s, eta: 1:00:49
[2024/11/19 15:42:27] ppcls INFO: [Train][Epoch 7/200][Iter: 30/64]lr(LinearWarmup): 0.01296875, top1: 0.31945, top2: 0.57208, CELoss: 1.79910, loss: 1.79910, batch_cost: 0.30486s, reader_cost: 0.07002, ips: 236.17567 samples/s, eta: 1:02:55
[2024/11/19 15:42:30] ppcls INFO: [Train][Epoch 7/200][Iter: 40/64]lr(LinearWarmup): 0.01328125, top1: 0.32392, top2: 0.56833, CELoss: 1.79408, loss: 1.79408, batch_cost: 0.30292s, reader_cost: 0.06790, ips: 237.68713 samples/s, eta: 1:02:28
[2024/11/19 15:42:33] ppcls INFO: [Train][Epoch 7/200][Iter: 50/64]lr(LinearWarmup): 0.01359375, top1: 0.32276, top2: 0.57021, CELoss: 1.79169, loss: 1.79169, batch_cost: 0.29811s, reader_cost: 0.06093, ips: 241.52322 samples/s, eta: 1:01:26
[2024/11/19 15:42:36] ppcls INFO: [Train][Epoch 7/200][Iter: 60/64]lr(LinearWarmup): 0.01390625, top1: 0.32497, top2: 0.57514, CELoss: 1.79087, loss: 1.79087, batch_cost: 0.30157s, reader_cost: 0.06483, ips: 238.75042 samples/s, eta: 1:02:06
[2024/11/19 15:42:37] ppcls INFO: [Train][Epoch 7/200][Avg]top1: 0.32559, top2: 0.57268, CELoss: 1.79186, loss: 1.79186
[2024/11/19 15:42:39] ppcls INFO: Already save model in E:\PaddleX3b2\output\latest\latest
[2024/11/19 15:42:49] ppcls INFO: Export inference config file to E:\PaddleX3b2\output\latest\inference\inference.yml
[2024/11/19 15:42:49] ppcls INFO: Export succeeded! The inference model exported has been saved in "E:\PaddleX3b2\output\latest\inference\inference".
changdazhou commented 2 days ago

好的,我们确认一下哈

cuicheng01 commented 2 days ago

目前是必须写入的,写入的原因也是为了保证万一程序中断,可以加载断点继续训练。当然您的建议也很好,我们后续可以考虑增加该能力。

188080501 commented 2 days ago

目前是必须写入的,写入的原因也是为了保证万一程序中断,可以加载断点继续训练。当然您的建议也很好,我们后续可以考虑增加该能力。

请问下,这个写入现在必须是每个Epoch都写入吗?能否设置间隔或者直接关闭

changdazhou commented 2 days ago

可以通过修改模型保存间隔来调整,但是无法关闭,添加参数:-o Train.save_interval=5

188080501 commented 2 days ago

可以通过修改模型保存间隔来调整,但是无法关闭,添加参数:-o Train.save_interval=5

我已经修改过 paddlex\configs\image_classification\PP-HGNetV2-B6.yaml 中的eval_interval和save_interval。但是这2个参数,包括你发的这个,都不会影响到latest的保存。设置后还是每个Epoch都会保存latest模型数据一次。

请问这个符合预期吗?

188080501 commented 2 days ago

目前是必须写入的,写入的原因也是为了保证万一程序中断,可以加载断点继续训练。当然您的建议也很好,我们后续可以考虑增加该能力。

现在的save_interval可以实现这个,比如说设置20个epoch保存一次,万一中断了可以从第20个epoch,或者第40个epoch开始重新训练。

我个人觉得,如果是为了防止中断问题的话,没有必要每个epoch都保存

changdazhou commented 2 days ago

刚刚去确认了一下,是无法修改的,我们会评估一下您这里提出的需求,在未来的版本中做相关的优化~

188080501 commented 2 days ago

刚刚去确认了一下,是无法修改的,我们会评估一下您这里提出的需求,在未来的版本中做相关的优化~

感谢回复,我自己找一下临时的方案

188080501 commented 2 days ago

我的临时方案如下,需要修改2个文件

第一个是: paddlex\repo_manager\repos\PaddleClas\ppcls\utils\save_load.py 在save_model函数的开头,增加如下代码:

    if prefix=="latest":
        return

第二个是: E:\PaddleX3b2\paddlex\repo_manager\repos\PaddleClas\ppcls\engine\engine.py 在export函数的开头,增加如下代码:

        if 'latest' in os.path.abspath(save_path).split(os.sep):
            logger.info(f"Skipping export to 'latest' path: {save_path}")
            return

修改后,latest的模型不会再保存了,训练的模型拿去推理也是正常的,应该是没有影响?我没有做进一步的验证 而且每个epoch之间的间隔时间明显缩短了,在1秒左右,GPU使用率有明显的提升,修改前要10秒以上,都在等待模型的保存

修改后,训练的部分log如下:

[2024/11/20 17:33:15] ppcls INFO: [Train][Epoch 15/50][Iter: 0/64]lr(LinearWarmup): 0.02803125, top1: 0.40278, top2: 0.65278, CELoss: 1.70701, loss: 1.70701, batch_cost: 0.27744s, reader_cost: 0.04297, ips: 259.51662 samples/s, eta: 0:10:39
[2024/11/20 17:33:18] ppcls INFO: [Train][Epoch 15/50][Iter: 10/64]lr(LinearWarmup): 0.02834375, top1: 0.34217, top2: 0.61237, CELoss: 1.76351, loss: 1.76351, batch_cost: 0.27889s, reader_cost: 0.04166, ips: 258.16577 samples/s, eta: 0:10:39
[2024/11/20 17:33:21] ppcls INFO: [Train][Epoch 15/50][Iter: 20/64]lr(LinearWarmup): 0.02865625, top1: 0.31195, top2: 0.58225, CELoss: 1.79635, loss: 1.79635, batch_cost: 0.30269s, reader_cost: 0.07979, ips: 237.86815 samples/s, eta: 0:11:31
[2024/11/20 17:33:24] ppcls INFO: [Train][Epoch 15/50][Iter: 30/64]lr(LinearWarmup): 0.02896875, top1: 0.30847, top2: 0.58993, CELoss: 1.77478, loss: 1.77478, batch_cost: 0.29790s, reader_cost: 0.05787, ips: 241.68835 samples/s, eta: 0:11:17
[2024/11/20 17:33:27] ppcls INFO: [Train][Epoch 15/50][Iter: 40/64]lr(LinearWarmup): 0.02928125, top1: 0.31256, top2: 0.59002, CELoss: 1.76980, loss: 1.76980, batch_cost: 0.29471s, reader_cost: 0.05649, ips: 244.30908 samples/s, eta: 0:11:07
[2024/11/20 17:33:30] ppcls INFO: [Train][Epoch 15/50][Iter: 50/64]lr(LinearWarmup): 0.02959375, top1: 0.31779, top2: 0.59807, CELoss: 1.76329, loss: 1.76329, batch_cost: 0.30174s, reader_cost: 0.06176, ips: 238.61734 samples/s, eta: 0:11:20
[2024/11/20 17:33:33] ppcls INFO: [Train][Epoch 15/50][Iter: 60/64]lr(LinearWarmup): 0.02990625, top1: 0.32336, top2: 0.59724, CELoss: 1.76220, loss: 1.76220, batch_cost: 0.30819s, reader_cost: 0.06626, ips: 233.62294 samples/s, eta: 0:11:31
[2024/11/20 17:33:34] ppcls INFO: [Train][Epoch 15/50][Avg]top1: 0.32405, top2: 0.59680, CELoss: 1.76033, loss: 1.76033
[2024/11/20 17:33:34] ppcls INFO: Skipping export to 'latest' path: E:\PaddleX3b2\output\latest\inference
[2024/11/20 17:33:34] ppcls INFO: Already save model info in E:\PaddleX3b2\output\latest
[2024/11/20 17:33:35] ppcls INFO: [Train][Epoch 16/50][Iter: 0/64]lr(LinearWarmup): 0.03003125, top1: 0.40278, top2: 0.55556, CELoss: 1.71605, loss: 1.71605, batch_cost: 0.29943s, reader_cost: 0.05862, ips: 240.45346 samples/s, eta: 0:11:10
[2024/11/20 17:33:38] ppcls INFO: [Train][Epoch 16/50][Iter: 10/64]lr(LinearWarmup): 0.03034375, top1: 0.34091, top2: 0.55682, CELoss: 1.75828, loss: 1.75828, batch_cost: 0.29676s, reader_cost: 0.05785, ips: 242.62124 samples/s, eta: 0:11:01
[2024/11/20 17:33:41] ppcls INFO: [Train][Epoch 16/50][Iter: 20/64]lr(LinearWarmup): 0.03065625, top1: 0.34812, top2: 0.59044, CELoss: 1.73452, loss: 1.73452, batch_cost: 0.30205s, reader_cost: 0.07734, ips: 238.37132 samples/s, eta: 0:11:10
[2024/11/20 17:33:44] ppcls INFO: [Train][Epoch 16/50][Iter: 30/64]lr(LinearWarmup): 0.03096875, top1: 0.33593, top2: 0.58307, CELoss: 1.74470, loss: 1.74470, batch_cost: 0.28816s, reader_cost: 0.05122, ips: 249.86037 samples/s, eta: 0:10:36
[2024/11/20 17:33:46] ppcls INFO: [Train][Epoch 16/50][Iter: 40/64]lr(LinearWarmup): 0.03128125, top1: 0.33873, top2: 0.58589, CELoss: 1.74587, loss: 1.74587, batch_cost: 0.28745s, reader_cost: 0.05068, ips: 250.47822 samples/s, eta: 0:10:32
[2024/11/20 17:33:49] ppcls INFO: [Train][Epoch 16/50][Iter: 50/64]lr(LinearWarmup): 0.03159375, top1: 0.34455, top2: 0.58566, CELoss: 1.74945, loss: 1.74945, batch_cost: 0.29490s, reader_cost: 0.06005, ips: 244.15189 samples/s, eta: 0:10:45
[2024/11/20 17:33:52] ppcls INFO: [Train][Epoch 16/50][Iter: 60/64]lr(LinearWarmup): 0.03190625, top1: 0.34085, top2: 0.58803, CELoss: 1.74796, loss: 1.74796, batch_cost: 0.29823s, reader_cost: 0.05858, ips: 241.42764 samples/s, eta: 0:10:50
[2024/11/20 17:33:53] ppcls INFO: [Train][Epoch 16/50][Avg]top1: 0.34313, top2: 0.59088, CELoss: 1.74439, loss: 1.74439
[2024/11/20 17:33:53] ppcls INFO: Skipping export to 'latest' path: E:\PaddleX3b2\output\latest\inference
[2024/11/20 17:33:53] ppcls INFO: Already save model info in E:\PaddleX3b2\output\latest
[2024/11/20 17:33:54] ppcls INFO: [Train][Epoch 17/50][Iter: 0/64]lr(LinearWarmup): 0.03203125, top1: 0.30556, top2: 0.61111, CELoss: 1.67692, loss: 1.67692, batch_cost: 0.29779s, reader_cost: 0.06592, ips: 241.78229 samples/s, eta: 0:10:47
[2024/11/20 17:33:57] ppcls INFO: [Train][Epoch 17/50][Iter: 10/64]lr(LinearWarmup): 0.03234375, top1: 0.34975, top2: 0.62247, CELoss: 1.74289, loss: 1.74289, batch_cost: 0.29540s, reader_cost: 0.05781, ips: 243.74050 samples/s, eta: 0:10:39
[2024/11/20 17:34:00] ppcls INFO: [Train][Epoch 17/50][Iter: 20/64]lr(LinearWarmup): 0.03265625, top1: 0.35017, top2: 0.62253, CELoss: 1.73232, loss: 1.73232, batch_cost: 0.30095s, reader_cost: 0.07678, ips: 239.24604 samples/s, eta: 0:10:48
[2024/11/20 17:34:03] ppcls INFO: [Train][Epoch 17/50][Iter: 30/64]lr(LinearWarmup): 0.03296875, top1: 0.34874, top2: 0.61876, CELoss: 1.73713, loss: 1.73713, batch_cost: 0.29552s, reader_cost: 0.05969, ips: 243.63916 samples/s, eta: 0:10:34
[2024/11/20 17:34:05] ppcls INFO: [Train][Epoch 17/50][Iter: 40/64]lr(LinearWarmup): 0.03328125, top1: 0.33769, top2: 0.60964, CELoss: 1.74524, loss: 1.74524, batch_cost: 0.28793s, reader_cost: 0.05502, ips: 250.06187 samples/s, eta: 0:10:15
[2024/11/20 17:34:08] ppcls INFO: [Train][Epoch 17/50][Iter: 50/64]lr(LinearWarmup): 0.03359375, top1: 0.33214, top2: 0.60717, CELoss: 1.74894, loss: 1.74894, batch_cost: 0.29025s, reader_cost: 0.05369, ips: 248.06092 samples/s, eta: 0:10:17
[2024/11/20 17:34:11] ppcls INFO: [Train][Epoch 17/50][Iter: 60/64]lr(LinearWarmup): 0.03390625, top1: 0.33072, top2: 0.60322, CELoss: 1.75104, loss: 1.75104, batch_cost: 0.29551s, reader_cost: 0.05236, ips: 243.64423 samples/s, eta: 0:10:25
[2024/11/20 17:34:12] ppcls INFO: [Train][Epoch 17/50][Avg]top1: 0.33173, top2: 0.60294, CELoss: 1.75067, loss: 1.75067
[2024/11/20 17:34:12] ppcls INFO: Skipping export to 'latest' path: E:\PaddleX3b2\output\latest\inference
changdazhou commented 1 day ago

好的,感谢您的建议