WongKinYiu / yolov7

Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
GNU General Public License v3.0
13.19k stars 4.17k forks source link

Validation kpt loss stays constant #1949

Open kurenai0413 opened 8 months ago

kurenai0413 commented 8 months ago

Recently I try to print out the validation loss of my customized pose model training with MS COCO dataset and noticed the kpt loss stays constant though epoches, while other losses such as box or obj act normally.

So I go back to the original pose branch and find the kpt loss of validation set is in same

validation_loss

here is my training command:

train.py --epoch 10 --data data/coco_kpts.yaml --cfg/yolov7-w6-pose.yml --batch-size 8 --img 960 --kpt-label --sync-bn --device 0 --name yolov7-w6-pose --hyp data/hyp.pose.yaml

Codes modified in test.py to print the validation loss: line 152:

loss += compute_loss([x.float() for x in train_out], targets)[1][:6]

Add before plot to print loss:

print(('\n' + '%10s' * 7) % ('_', 'box', 'obj', 'cls', 'kpt', 'kptv' ,'total'))
print(('%10s' * 1 + '%10.4g' * 6) % ('val', *(loss.cpu() / len(dataloader)).tolist()) )

Any help is greatly appreciated.

mosama182 commented 2 months ago

I am facing the same issue. Did you figure out what was the problem?

Ethan-Lee-Sunghoon commented 1 month ago

I'm also facing the same issue.

I've printed all of the elements for validation loss.

In 'loss.py' file, lkpt is always the same every validation step.

This is because the distance d in oks loss is large during validation.

For example, targets have small scale values compared to the validation prediction values.

Target keypoints (x): tensor(3.91011, device:'cuda:0') Predicted keypoints (x): tensor(718., device:'cuda:0')

This big value leads to a large d value, which leads the exponential to zero.

Distance (d): tensor(659069.75000, device:'cuda:0')

#oks based loss d = (pkpt_x-tkpt[i][:,0::2])**2 + (pkpt_y-tkpt[i][:,1::2])**2 s = torch.prod(tbox[i][:,:-2], dim=1, keepdim=True) kpt_loss_factor = (torch.sum(kpt_mask != 0) + torch.sum(kpt_mask == 0))/torch.sum(kpt_mask != 0) lkpt += kpt_loss_factor*((1 - torch.exp(-d/(s*(4*sigmas**2)+1e-9)))*kpt_mask).mean()

This is what I found so far.