Overfitting issue - Githubissues

johannes-tum commented 3 months ago

Issue Description I tried to train the network on another private dataset. I started with overfitting on a single image. I noticed that a lot of optimizer steps are skipped, because of invalid gradients. As a consequence, the network did not really converge in even 500 epochs. Once I added this block

        for p in self.model.parameters(): 
            if p.grad is None: 
                p.grad = torch.zeros_like(p)
            else:
                is_nan = torch.isnan(p.grad)
                p.grad[is_nan] = torch.zeros_like(p.grad[is_nan])

after self.scaler.scale(loss).backward() it worked better. But I guess there must be a better way than this.

henrytsui000 commented 3 months ago

Guten Tag!

Thank you for mentioning the issue and providing your solution. I believe this situation is caused by a bug in the loss calculation. I forgot to detach the predicted tensor when the BoxMatcher was finding the corresponding bbox. This occurs in https://github.com/WongKinYiu/YOLO/blob/868c821de803cf5cfbf3e5d7d48571fc3015616e/yolo/tools/loss_functions.py#L91

I have fixed these bugs in commit 4775b4c6b1040e41ab38fe35a51099dcb9299417, but I'm not entirely sure if everything is resolved. I tried training the model on a small dataset, and it seems to be working correctly now. However, some data augmentations are still under development.

I strongly recommend training via the YOLOv9 origin repo to avoid wasting GPU resources. I will release version 1.0 after most of the code is completed.

Mit freundlichen Grüßen, Henry Tsui

johannes-tum commented 3 months ago

All right! Thanks! I know these things are hard to predict, but do you have a rough time frame in mind when v1 might be ready?

WongKinYiu / YOLO

Overfitting issue #48