PaddlePaddle / PaddleVideo

Awesome video understanding toolkits based on PaddlePaddle. It supports video data annotation tools, lightweight RGB and skeleton based action recognition model, practical applications for video tagging and sport action detection.
Apache License 2.0
1.46k stars 374 forks source link

slowfast训练问题:训练过程中没有进行val,训练loss为0.00000 #652

Open Bradly-s opened 7 months ago

Bradly-s commented 7 months ago

slowfast训练问题:

  1. 训练过程中没有进行val,训练loss为0.00000。
  2. 显存占用正常,显卡利用率变化频率慢,长时间为0.

部分训练日志如下:

[11/08 16:52:02] Training in fp32 mode. [11/08 16:52:25] epoch:[ 1/196] train step:0 loss: 0.00000 lr: 0.010000 top1: 0.00000 top5: 0.00000 batch_cost: 23.42944 sec, reader_cost: 22.06920 sec, ips: 0.68290 instance/sec, eta: 2:32:41 [11/08 16:52:25] END epoch:1 train loss_avg: 0.00000 top1_avg: 0.00000 top5_avg: 0.00000 avg_batch_cost: 0.26917 sec, avg_reader_cost: 0.00035 sec, batch_cost_sum: 23.69861 sec, avg_ips: 1.35029 instance/sec. [11/08 16:52:45] Computing precise BN 1 / 2... [11/08 16:52:46] Computing precise BN 2 / 2... [11/08 16:53:05] epoch:[ 1/196] val step:0 loss: 0.00000 top1: 0.00000 top5: 0.00000 batch_cost: 18.35165 sec, reader_cost: 0.00000 sec, ips: 0.87186 instance/sec. [11/08 16:53:05] END epoch:1 val loss_avg: 0.00000 top1_avg: 0.00000 top5_avg: 0.00000 avg_batch_cost: 0.01305 sec, avg_reader_cost: 0.00000 sec, batch_cost_sum: 18.36470 sec, avg_ips: 1.74247 instance/sec. [11/08 16:53:29] epoch:[ 2/196] train step:0 loss: 0.00000 lr: 0.012434 top1: 0.00000 top5: 0.00000 batch_cost: 24.02960 sec, reader_cost: 23.69901 sec, ips: 0.66585 instance/sec, eta: 0:51:56 [11/08 16:53:29] END epoch:2 train loss_avg: 0.00000 top1_avg: 0.00000 top5_avg: 0.00000 avg_batch_cost: 0.26960 sec, avg_reader_cost: 0.00036 sec, batch_cost_sum: 24.29920 sec, avg_ips: 1.31692 instance/sec. [11/08 16:53:52] epoch:[ 3/196] train step:0 loss: 0.00000 lr: 0.014868 top1: 0.00000 top5: 0.00000 batch_cost: 22.49086 sec, reader_cost: 22.18557 sec, ips: 0.71140 instance/sec, eta: 0:29:01 [11/08 16:53:52] END epoch:3 train loss_avg: 0.00000 top1_avg: 0.00000 top5_avg: 0.00000 avg_batch_cost: 0.36894 sec, avg_reader_cost: 0.09945 sec, batch_cost_sum: 22.85980 sec, avg_ips: 1.39984 instance/sec. [11/08 16:54:15] epoch:[ 4/196] train step:0 loss: 0.00000 lr: 0.017302 top1: 0.00000 top5: 0.00000 batch_cost: 22.29009 sec, reader_cost: 21.95662 sec, ips: 0.71781 instance/sec, eta: 0:20:26 [11/08 16:54:15] END epoch:4 train loss_avg: 0.00000 top1_avg: 0.00000 top5_avg: 0.00000 avg_batch_cost: 0.26816 sec, avg_reader_cost: 0.00040 sec, batch_cost_sum: 22.55826 sec, avg_ips: 1.41855 instance/sec. [11/08 16:54:38] epoch:[ 5/196] train step:0 loss: 0.00000 lr: 0.019736 top1: 0.00000 top5: 0.00000 batch_cost: 23.43676 sec, reader_cost: 23.10259 sec, ips: 0.68269 instance/sec, eta: 0:16:37 [11/08 16:54:39] END epoch:5 train loss_avg: 0.00000 top1_avg: 0.00000 top5_avg: 0.00000 avg_batch_cost: 0.26950 sec, avg_reader_cost: 0.00036 sec, batch_cost_sum: 23.70625 sec, avg_ips: 1.34985 instance/sec.

westfish commented 5 months ago

你好,根据你提供的日志,训练过程中的损失值(loss)为0.00000,这可能意味着模型在训练数据上已经达到了完美的拟合,但这通常是不可能的。这可能是由于以下几个原因:

学习率过高:如果学习率设置得过高,可能会导致梯度爆炸,从而使损失变为NaN或者01。你可以尝试降低学习率,看看是否能解决问题。

数据问题:如果你的数据集存在问题,比如标签错误,或者数据预处理步骤有误,也可能导致模型无法正确学习。你需要检查你的数据集,确保数据和标签的正确性1。

模型问题:如果你的模型结构或者初始化存在问题,也可能导致损失为0。你需要检查你的模型,确保模型的结构和初始化是正确的1。

另外,你提到显卡利用率变化频率慢,长时间为0,这可能是因为数据加载的速度跟不上模型的训练速度。你可以尝试优化你的数据加载过程,比如使用更高效的数据预处理库,或者使用多线程来并行加载数据23。

另外,如果有图像、视频理解和生成的需求,可以使用我们新的跨模态工具: https://github.com/PaddlePaddle/PaddleMIX/tree/develop