Open Bradly-s opened 7 months ago
你好,根据你提供的日志,训练过程中的损失值(loss)为0.00000,这可能意味着模型在训练数据上已经达到了完美的拟合,但这通常是不可能的。这可能是由于以下几个原因:
学习率过高:如果学习率设置得过高,可能会导致梯度爆炸,从而使损失变为NaN或者01。你可以尝试降低学习率,看看是否能解决问题。
数据问题:如果你的数据集存在问题,比如标签错误,或者数据预处理步骤有误,也可能导致模型无法正确学习。你需要检查你的数据集,确保数据和标签的正确性1。
模型问题:如果你的模型结构或者初始化存在问题,也可能导致损失为0。你需要检查你的模型,确保模型的结构和初始化是正确的1。
另外,你提到显卡利用率变化频率慢,长时间为0,这可能是因为数据加载的速度跟不上模型的训练速度。你可以尝试优化你的数据加载过程,比如使用更高效的数据预处理库,或者使用多线程来并行加载数据23。
另外,如果有图像、视频理解和生成的需求,可以使用我们新的跨模态工具: https://github.com/PaddlePaddle/PaddleMIX/tree/develop
slowfast训练问题:
部分训练日志如下:
[11/08 16:52:02] Training in fp32 mode. [11/08 16:52:25] epoch:[ 1/196] train step:0 loss: 0.00000 lr: 0.010000 top1: 0.00000 top5: 0.00000 batch_cost: 23.42944 sec, reader_cost: 22.06920 sec, ips: 0.68290 instance/sec, eta: 2:32:41 [11/08 16:52:25] END epoch:1 train loss_avg: 0.00000 top1_avg: 0.00000 top5_avg: 0.00000 avg_batch_cost: 0.26917 sec, avg_reader_cost: 0.00035 sec, batch_cost_sum: 23.69861 sec, avg_ips: 1.35029 instance/sec. [11/08 16:52:45] Computing precise BN 1 / 2... [11/08 16:52:46] Computing precise BN 2 / 2... [11/08 16:53:05] epoch:[ 1/196] val step:0 loss: 0.00000 top1: 0.00000 top5: 0.00000 batch_cost: 18.35165 sec, reader_cost: 0.00000 sec, ips: 0.87186 instance/sec. [11/08 16:53:05] END epoch:1 val loss_avg: 0.00000 top1_avg: 0.00000 top5_avg: 0.00000 avg_batch_cost: 0.01305 sec, avg_reader_cost: 0.00000 sec, batch_cost_sum: 18.36470 sec, avg_ips: 1.74247 instance/sec. [11/08 16:53:29] epoch:[ 2/196] train step:0 loss: 0.00000 lr: 0.012434 top1: 0.00000 top5: 0.00000 batch_cost: 24.02960 sec, reader_cost: 23.69901 sec, ips: 0.66585 instance/sec, eta: 0:51:56 [11/08 16:53:29] END epoch:2 train loss_avg: 0.00000 top1_avg: 0.00000 top5_avg: 0.00000 avg_batch_cost: 0.26960 sec, avg_reader_cost: 0.00036 sec, batch_cost_sum: 24.29920 sec, avg_ips: 1.31692 instance/sec. [11/08 16:53:52] epoch:[ 3/196] train step:0 loss: 0.00000 lr: 0.014868 top1: 0.00000 top5: 0.00000 batch_cost: 22.49086 sec, reader_cost: 22.18557 sec, ips: 0.71140 instance/sec, eta: 0:29:01 [11/08 16:53:52] END epoch:3 train loss_avg: 0.00000 top1_avg: 0.00000 top5_avg: 0.00000 avg_batch_cost: 0.36894 sec, avg_reader_cost: 0.09945 sec, batch_cost_sum: 22.85980 sec, avg_ips: 1.39984 instance/sec. [11/08 16:54:15] epoch:[ 4/196] train step:0 loss: 0.00000 lr: 0.017302 top1: 0.00000 top5: 0.00000 batch_cost: 22.29009 sec, reader_cost: 21.95662 sec, ips: 0.71781 instance/sec, eta: 0:20:26 [11/08 16:54:15] END epoch:4 train loss_avg: 0.00000 top1_avg: 0.00000 top5_avg: 0.00000 avg_batch_cost: 0.26816 sec, avg_reader_cost: 0.00040 sec, batch_cost_sum: 22.55826 sec, avg_ips: 1.41855 instance/sec. [11/08 16:54:38] epoch:[ 5/196] train step:0 loss: 0.00000 lr: 0.019736 top1: 0.00000 top5: 0.00000 batch_cost: 23.43676 sec, reader_cost: 23.10259 sec, ips: 0.68269 instance/sec, eta: 0:16:37 [11/08 16:54:39] END epoch:5 train loss_avg: 0.00000 top1_avg: 0.00000 top5_avg: 0.00000 avg_batch_cost: 0.26950 sec, avg_reader_cost: 0.00036 sec, batch_cost_sum: 23.70625 sec, avg_ips: 1.34985 instance/sec.