Closed vineetsharma14 closed 1 year ago
Hi @vineetsharma14, I suggest validating your model on a val set before tuning any hyper-parameters.
For example, at batch size of 1, the stating total loss was 87 which reduced to around 13 in 8000 iterations. But after that the train loss oscillates between the values of 9 to 28.
This is not uncommon, the range seems more than expected but it could be due to your dataset. I cannot make any comments without knowing the validation results. Any hyper-parameter tuning also depends on the number of classes in your dataset.
Thanks @praeclarumjj3 for the guidance. Really appreciate it !
I will check the dataset.
Hi @vineetsharma14 ,
Were you able to successfully reduce the training loss after finetuning? I am facing the same pattern in my finetuning experiments.. May I know how are you setting up the learning rate for text mapper while finetuning?
Hello There,
Thanks for sharing the amazing work!
I have been experimenting OneFormer repo since past few days and I am able to run the training (fine-tuning) for Instance Segmentations using Custom Dataset on 1 GPU (Tesla T4) by reducing the image size to 512.
The following are the changes I have made to my configuration.
cfg.INPUT.IMAGE_SIZE = 512
cfg.SOLVER.IMS_PER_BATCH = 1
(Even 16 works)cfg.MODEL.ROI_HEADS.NUM_CLASSES = <Number Of Classes In My Dataset>
cfg.MODEL.RETINANET.NUM_CLASSES = <Number Of Classes In My Dataset>
cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES = <Number Of Classes In My Dataset>
cfg.SOLVER.MAX_ITERA = 40000
with default Base Learning Rate of 0.0001
COCO DINAT Configuration file : oneformer_dinat_large_bs16_100ep.yaml
MODEL WEIGHTS : 150_16_dinat_l_oneformer_coco_100ep.pth
My dataset has approx 10,000 images in the train set.
I found the Training Settings you have used from the Appendix Section of the Paper. So, a batch size of 16 was used for around 90K or more iterations depending on the datasets.
I have trained the model with varying batch sizes but I observe that the total loss stop reducing after few thought iterations.
For example, at batch size of 1, the stating total loss was 87 which reduced to around 13 in 8000 iterations. But after that the train loss oscillates between the values of 9 to 28.
So, with this observation - what should is recommended ?
Thanks for the help !