Open Chuyaoyuan opened 1 year ago
将train_l下的mean_std.json 复制到 data目录下,可以正常训练,训练时出现“Out of memory error on GPU 0”,通过修改batch_size ,由默认32改为16,已启动训练,日志:
2023-07-24 08:49:00.413 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 62, lr: 0.00001260, loss: 261.90643311, att_loss: 256.81335449, ctc_loss: 273.79022217, batch_size: 16, accum: 32, step_cost: 0.66807246, iter: 2000, reader_cost: 0.00027585, batch_cost: 0.66834831, samples: 16, ips: 23.93961308 samples/s
2023-07-24 08:49:40.230 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 65, lr: 0.00001320, loss: 51.20448685, att_loss: 51.88084030, ctc_loss: 49.62632751, batch_size: 16, accum: 32, step_cost: 0.24358082, iter: 2100, reader_cost: 0.00029826, batch_cost: 0.24387908, samples: 16, ips: 65.60628329 samples/s
2023-07-24 08:50:05.563 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 68, lr: 0.00001380, loss: 60.56448364, att_loss: 60.88214111, ctc_loss: 59.82329559, batch_size: 16, accum: 32, step_cost: 0.25470424, iter: 2200, reader_cost: 0.00027466, batch_cost: 0.25497890, samples: 16, ips: 62.75029150 samples/s
2023-07-24 08:50:32.380 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 71, lr: 0.00001440, loss: 80.20925903, att_loss: 79.79570770, ctc_loss: 81.17420197, batch_size: 16, accum: 32, step_cost: 0.26200438, iter: 2300, reader_cost: 0.00037360, batch_cost: 0.26237798, samples: 16, ips: 60.98072773 samples/s
2023-07-24 08:51:01.431 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 74, lr: 0.00001500, loss: 85.94498444, att_loss: 85.52644348, ctc_loss: 86.92157745, batch_size: 16, accum: 32, step_cost: 0.42933178, iter: 2400, reader_cost: 0.00029373, batch_cost: 0.42962551, samples: 16, ips: 37.24173631 samples/s
2023-07-24 08:52:04.883 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 78, lr: 0.00001580, loss: 157.27227783, att_loss: 155.45193481, ctc_loss: 161.51976013, batch_size: 16, accum: 32, step_cost: 0.67493510, iter: 2500, reader_cost: 0.00032640, batch_cost: 0.67526150, samples: 16, ips: 23.69452436 samples/s
2023-07-24 08:52:31.755 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 81, lr: 0.00001640, loss: 38.29604340, att_loss: 38.94356537, ctc_loss: 36.78514862, batch_size: 16, accum: 32, step_cost: 0.24099350, iter: 2600, reader_cost: 0.00031066, batch_cost: 0.24130416, samples: 16, ips: 66.30635815 samples/s
2023-07-24 08:52:57.162 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 84, lr: 0.00001700, loss: 60.50916290, att_loss: 60.66139984, ctc_loss: 60.15393829, batch_size: 16, accum: 32, step_cost: 0.26090431, iter: 2700, reader_cost: 0.00029659, batch_cost: 0.26120090, samples: 16, ips: 61.25553053 samples/s
2023-07-24 08:53:23.778 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 87, lr: 0.00001760, loss: 80.58346558, att_loss: 80.13731384, ctc_loss: 81.62448883, batch_size: 16, accum: 32, step_cost: 0.26380491, iter: 2800, reader_cost: 0.00027609, batch_cost: 0.26408100, samples: 16, ips: 60.58747097 samples/s
2023-07-24 08:53:52.784 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 90, lr: 0.00001820, loss: 101.23880768, att_loss: 100.54857635, ctc_loss: 102.84934998, batch_size: 16, accum: 32, step_cost: 0.30528116, iter: 2900, reader_cost: 0.00027537, batch_cost: 0.30555654, samples: 16, ips: 52.36346839 samples
麻烦帮看下是否问题,预计这个训练需要多久?显卡为Tesla T4 16G * 2 ,多谢
将train_l下的mean_std.json 复制到 data目录下,可以正常训练,训练时出现“Out of memory error on GPU 0”,通过修改batch_size ,由默认32改为16,已启动训练,日志:
2023-07-24 08:49:00.413 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 62, lr: 0.00001260, loss: 261.90643311, att_loss: 256.81335449, ctc_loss: 273.79022217, batch_size: 16, accum: 32, step_cost: 0.66807246, iter: 2000, reader_cost: 0.00027585, batch_cost: 0.66834831, samples: 16, ips: 23.93961308 samples/s 2023-07-24 08:49:40.230 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 65, lr: 0.00001320, loss: 51.20448685, att_loss: 51.88084030, ctc_loss: 49.62632751, batch_size: 16, accum: 32, step_cost: 0.24358082, iter: 2100, reader_cost: 0.00029826, batch_cost: 0.24387908, samples: 16, ips: 65.60628329 samples/s 2023-07-24 08:50:05.563 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 68, lr: 0.00001380, loss: 60.56448364, att_loss: 60.88214111, ctc_loss: 59.82329559, batch_size: 16, accum: 32, step_cost: 0.25470424, iter: 2200, reader_cost: 0.00027466, batch_cost: 0.25497890, samples: 16, ips: 62.75029150 samples/s 2023-07-24 08:50:32.380 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 71, lr: 0.00001440, loss: 80.20925903, att_loss: 79.79570770, ctc_loss: 81.17420197, batch_size: 16, accum: 32, step_cost: 0.26200438, iter: 2300, reader_cost: 0.00037360, batch_cost: 0.26237798, samples: 16, ips: 60.98072773 samples/s 2023-07-24 08:51:01.431 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 74, lr: 0.00001500, loss: 85.94498444, att_loss: 85.52644348, ctc_loss: 86.92157745, batch_size: 16, accum: 32, step_cost: 0.42933178, iter: 2400, reader_cost: 0.00029373, batch_cost: 0.42962551, samples: 16, ips: 37.24173631 samples/s 2023-07-24 08:52:04.883 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 78, lr: 0.00001580, loss: 157.27227783, att_loss: 155.45193481, ctc_loss: 161.51976013, batch_size: 16, accum: 32, step_cost: 0.67493510, iter: 2500, reader_cost: 0.00032640, batch_cost: 0.67526150, samples: 16, ips: 23.69452436 samples/s 2023-07-24 08:52:31.755 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 81, lr: 0.00001640, loss: 38.29604340, att_loss: 38.94356537, ctc_loss: 36.78514862, batch_size: 16, accum: 32, step_cost: 0.24099350, iter: 2600, reader_cost: 0.00031066, batch_cost: 0.24130416, samples: 16, ips: 66.30635815 samples/s 2023-07-24 08:52:57.162 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 84, lr: 0.00001700, loss: 60.50916290, att_loss: 60.66139984, ctc_loss: 60.15393829, batch_size: 16, accum: 32, step_cost: 0.26090431, iter: 2700, reader_cost: 0.00029659, batch_cost: 0.26120090, samples: 16, ips: 61.25553053 samples/s 2023-07-24 08:53:23.778 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 87, lr: 0.00001760, loss: 80.58346558, att_loss: 80.13731384, ctc_loss: 81.62448883, batch_size: 16, accum: 32, step_cost: 0.26380491, iter: 2800, reader_cost: 0.00027609, batch_cost: 0.26408100, samples: 16, ips: 60.58747097 samples/s 2023-07-24 08:53:52.784 | INFO | paddlespeech.s2t.exps.u2.model:do_train:214 - Train: Rank: 0, epoch: 0, step: 90, lr: 0.00001820, loss: 101.23880768, att_loss: 100.54857635, ctc_loss: 102.84934998, batch_size: 16, accum: 32, step_cost: 0.30528116, iter: 2900, reader_cost: 0.00027537, batch_cost: 0.30555654, samples: 16, ips: 52.36346839 samples
麻烦帮看下是否问题,预计这个训练需要多久?显卡为Tesla T4 16G * 2 ,多谢
你好,请问你是如何进行第一步的数据处理的呢?我运行./run.sh --stage 0 --stop_stage 0。提示没有num_workers、train_config、in_scp、out_cmvn参数,我发现compute-cmvn-stats.py出自wenet项目,但是拿来用的时候,paddlespeech的配置文件里又没有compute-cmvn-stats.py中需要的参数。怎么办呢
环境 python==3.8 paddlepaddle==0.0.0(2.5.0开发版) paddlespeech==0.0.0 从develop分支编译安装 问题描述 按照[https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/wenetspeech/asr1)中示例训练wenetspeech数据集,数据处理已完毕,执行训练报错,提示No such file or directory: 'data/mean_std.json'。
指令:bash run.sh --gpus 0,1 --stage 1 --stop_stage 1 训练脚本:https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/wenetspeech/asr1/local/train.sh 日志:
进入examples/wenetspeech/asr1/data/目录,确实没有mean_std.json,在各个子文件夹中发现mean_std.json,以下是examples/wenetspeech/asr1/data/结构
之前也跑过examples/aishell/asr1/流程,发现wenetspeech和aishell的脚本以及生成文件相差好多,麻烦帮看下如何完整执行wenetspeech/asr1的训练,多谢