val/obj loss and val/box loss keep raising in the training of yolo pose with coco dataset

RustyShackleford73 commented 1 year ago

niuhouxing commented 1 year ago

Hello, I also encountered this problem, have you solved it?

zay95 commented 1 year ago

Hi, May I have a look at your lr and loss curve in the tensorboard? My curve look bad. I finetuned a model from yolov7-w6-pose.pt with coco dataset and custom dataset(1/4 percent) , but the results show that bbox prediction is not good enough( single person with multi bbox) , and the learning rate curve looks wrong.

Train loss and val loss also looks not good.

Hyper paras: lr0: 0.01 , lrf: 0.1 # final OneCycleLR learning rate (lr0 * lrf) momentum: 0.937 # SGD momentum/Adam beta1 weight_decay: 0.0005 # optimizer weight decay 5e-4 warmup_epochs: 3.0 # warmup epochs (fractions ok) warmup_momentum: 0.8 # warmup initial momentum warmup_bias_lr: 0.1 # warmup initial bias lr

why lr0 is so big and lr1 is zreo?-_-

gadewegit commented 1 year ago

你好，我可以看看你在 tensorboard 中的 lr 和 loss 曲线吗？我的曲线看起来很糟糕。我使用 coco 数据集和自定义数据集（1/4 百分比）从yolov7-w6-pose.pt微调了一个模型，但结果表明 bbox 预测不够好（单人多 bbox），学习率曲线看起来错误的。

train loss 和 val loss 看起来也不太好。

Hyper paras: lr0: 0.01 , lrf: 0.1 # final OneCycleLR 学习率 (lr0 * lrf) momentum: 0.937 # SGD momentum/Adam beta1 weight_decay: 0.0005 # optimizer weight decay 5e-4 warmup_epochs: 3.0 # warmup epochs (fractions ok) warmup_momentum : 0.8 # warmup initial momentum warmup_bias_lr: 0.1 # warmup initial bias lr

为什么lr0那么大而lr1是zreo？-_-

我的lr1也是这样，你的val loss是上升的么？

zay95 commented 1 year ago

@gadewegit en, The val loss curve is ascending.

And lr curve

gadewegit commented 1 year ago

请问对于val loss 升高，有什么解决办法么？不知道为什么我的lr2会是这样的。[cid:311d3225-318e-4cdc-acb5-12b2fe635aad]

发件人: zay @.> 发送时间: 2023年3月20日 15:34 收件人: WongKinYiu/yolov7 @.> 抄送: gadewegit @.>; Mention @.> 主题: Re: [WongKinYiu/yolov7] val/obj loss and val/box loss keep raising in the training of yolo pose with coco dataset (Issue #1361)

@gadewegithttps://github.com/gadewegit en, The val loss curve is ascending. [image]https://user-images.githubusercontent.com/33301898/226274571-847796b5-6530-4adc-99f5-e80269369646.png

And lr curve [image]https://user-images.githubusercontent.com/33301898/226274697-1dfaf4d8-fa9d-4126-aad9-938716908179.png

― Reply to this email directly, view it on GitHubhttps://github.com/WongKinYiu/yolov7/issues/1361#issuecomment-1475748348, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6SYEZKWT42LJTGKWXBIANLW5AB6RANCNFSM6AAAAAATPYOOVI. You are receiving this because you were mentioned.Message ID: @.***>

zay95 commented 1 year ago

@gadewegit emmm, I'm trying it out ..

gadewegit commented 1 year ago

@gadewegit emmm, I'm trying it out ..

好的，希望我们保持联系。

zay95 commented 1 year ago

@gadewegit emmm, I'm trying it out ..

好的，希望我们保持联系。

@gadewegit Hi, I finetuned a model with coco dataset and custom dataset(1/4 percent，but only simple samples, person background is not similar to the coco data) . So I trained only the head section parameters of model , changed train process (deleted the warm up stage , modified the learning strategy, and decreased the initial value of the learning rate). In addition , according to some issues in TexasInstruments/edgeai-yolov5 , changing kps_loss , increasing scale factor weight in loss function can alleviate the problem. Now the shifted points no longer appears in coco data and customer data.

gadewegit commented 1 year ago

@gadewegitemmm，我在试试。。

好的，希望我们保持联系。

@gadewegit嗨，我用 coco 数据集和自定义数据集（1/4%，但只有简单的样本，人物背景与 coco 数据不相似）微调了一个模型。所以我只训练了模型的head section参数，改变了训练过程（删除了warm up stage，修改了学习策略，降低了学习率的初始值）。另外，根据TexasInstruments/edgeai-yolov5中的一些问题，改变kps_loss，增加损失函数中的比例因子权重可以缓解问题。现在转移的点不再出现在coco数据和customer数据中。

Oh, I'm glad to hear that you have solved some problems. Do obj loss, box loss, and learning rate curves appear normal? May I have a look at your curve in the tensorboard?

zay95 commented 1 year ago

Oh, I'm glad to hear that you have solved some problems. Do obj loss, box loss, and learning rate curves appear normal? May I have a look at your curve in the tensorboard?

@gadewegit This is map@0.5:0.95 curve, due to pre training, the map score is initially convergent. There are 25% custom data (totally different background), I think the model learned from it.

This is the train loss curve. The val loss curve is initially convergent after some epoch.

There is some probleam in the tensorbard lr curve. The curve is right but the value is not corect.

gadewegit commented 1 year ago

Oh, I'm glad to hear that you have solved some problems. Do obj loss, box loss, and learning rate curves appear normal? May I have a look at your curve in the tensorboard?

@gadewegit This is map@0.5:0.95 curve, due to pre training, the map score is initially convergent. There are 25% custom data (totally different background), I think the model learned from it.

This is the train loss curve. The val loss curve is initially convergent after some epoch.

There is some probleam in the tensorbard lr curve. The curve is right but the value is not corect.

I'm glad to see that your val loss has converged, but my val loss still has problems. Could you give me specific guidance? In addition, our lr curve is different. I hope to get your help. Thanks！

zay95 commented 1 year ago

@gadewegit Could you attach your train command line code and train & val curve？

gadewegit commented 1 year ago

This is what I get at the end of 300 epochs.
python train.py --kpt-label
Some of my training parameters are set to train.py，It's all straight from github, unchanged.

---- Replied Message ----

From	*@.**>
Date	03/23/2023 12:08
To	*@.**>
Cc	*@.>、@.*>
Subject	Re: [WongKinYiu/yolov7] val/obj loss and val/box loss keep raising in the training of yolo pose with coco dataset (Issue #1361)

Could you attach your train command line code and train & val curve？

—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.

zay95 commented 1 year ago

@gadewegit github text format error

gadewegit commented 1 year ago

I need to download them all again?

---- Replied Message ----

From	*@.**>
Date	03/23/2023 12:19
To	*@.**>
Cc	*@.>、@.*>
Subject	Re: [WongKinYiu/yolov7] val/obj loss and val/box loss keep raising in the training of yolo pose with coco dataset (Issue #1361)

github text format error

—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.

gadewegit commented 1 year ago

@gadewegit Could you attach your train command line code and train & val curve？

85ee2f0cc0fab87e1f6b25e51068c2d 这是我刚跑的几个epoch，依旧还是有之前的问题 python train.py --kpt-label

gadewegit commented 1 year ago

@gadewegit 能附上你的火车命行代码和train & val曲线吗？

results This is the 300 epochs I've run before. The prcision curve fluctuates too much, but recall shows a straight line, unlike the downward trend in tensorboard. Val loss still shows an upward trend. I can't solve this problem. It's been bothering me for a long time.Thanks!

zay95 commented 1 year ago

@gadewegit what's your train hyp parameters, eg: python train.py --data data/coco_kpts.yaml --cfg cfg/yolov7-w6-pose.yaml --weights weights/yolov7-w6-pose.pt --batch-size 16 --img 640 --kpt-label --device 0 --name aaaa --hyp data/hyp.pose.yaml --epochs 300 --workers 8

gadewegit commented 1 year ago

@gadewegit 你的训练参数是什么，例如：python train.py --data data/coco_kpts.yaml --cfg cfg/yolov7-w6-pose.yaml --weights weights/yolov7-w6-pose.pt --batch-size 16 --img 640 --kpt-label --device 0 --name aaaa --hyp data/hyp.pose.yaml --epochs 300 --workers 8

python train.py --data data/coco_kpts.yaml --cfg cfg/yolov7-w6-pose.yaml --weights weights/yolov7-w6-pose.pt --batch-size 8 --img 640 --device 0 --hyp data/hyp.pose.yaml --epochs 300 --workers 8 --name exp --kpt-label Basically unchanged, they are downloaded from github and used directly

zay95 commented 1 year ago

@gadewegit 你的训练参数是什么，例如：python train.py --data data/coco_kpts.yaml --cfg cfg/yolov7-w6-pose.yaml --weights weights/yolov7-w6-pose.pt --batch-size 16 --img 640 --kpt-label --device 0 --name aaaa --hyp data/hyp.pose.yaml --epochs 300 --workers 8

python train.py --data data/coco_kpts.yaml --cfg cfg/yolov7-w6-pose.yaml --weights weights/yolov7-w6-pose.pt --batch-size 8 --img 640 --device 0 --hyp data/hyp.pose.yaml --epochs 300 --workers 8 --name exp --kpt-label Basically unchanged, they are downloaded from github and used directly

I suggest that you can delete warm up stage and set the learning rate to 1e-4 or 1e-5. And training only the head of model may be better.

gadewegit commented 1 year ago

@gadewegit 你的训练参数是什么，例如：python train.py --data data/coco_kpts.yaml --cfg cfg/yolov7-w6-pose.yaml --weights weights/yolov7-w6-pose.pt --batch-size 16 --img 640 --kpt-label --device 0 --name aaaa --hyp data/hyp.pose.yaml --epochs 300 --workers 8

python train.py --data data/coco_kpts.yaml --cfg cfg/yolov7-w6-pose.yaml --weights weights/yolov7-w6-pose.pt --batch-size 8 --img 640 --device 0 --hyp data/hyp.pose.yaml --epochs 300 --workers 8 --name exp --kpt-label Basically unchanged, they are downloaded from github and used directly

I suggest that you can delete warm up stage and set the learning rate to 1e-4 or 1e-5. And training only the head of model may be better.

3e10ea977df20e9e43f1d62527e4dbe Is this how to improve the warm-up stage and learning rate?

zay95 commented 1 year ago

@gadewegit you can set lr0:0.001, and change the train.py blow:

gadewegit commented 1 year ago

@gadewegit you can set lr0:0.001, and change the train.py blow:

Ok, I'll try it next. Thanks again！

zhangYQHBAU commented 4 months ago

@gadewegit 能附上你的火车命行代码和train & val曲线吗？

This is the 300 epochs I've run before. The prcision curve fluctuates too much, but recall shows a straight line, unlike the downward trend in tensorboard. Val loss still shows an upward trend. I can't solve this problem. It's been bothering me for a long time.Thanks!

I have meet the same problem, have you deal with it ?

WongKinYiu / yolov7

val/obj loss and val/box loss keep raising in the training of yolo pose with coco dataset #1361