CondInst gives BBox AP:15 and Segm AP:0.0063 on Evaluation Set after being trained for 3K epochs!

zeeshanalipanhwar commented 3 years ago

Hi, I trained the CondInst on a custom dataset for 3000 epochs.

Training losses:

[01/08 15:07:24 d2.utils.events]:  eta: 0:00:00  iter: 2999  total_loss: 1.592  loss_fcos_cls: 0.2424  loss_fcos_loc: 0.2422  loss_fcos_ctr: 0.6233  loss_mask: 0.4742  time: 2.2331  data_time: 0.0964  lr: 1e-05  max_mem: 8088M

Validation results:

[01/08 15:10:05 d2.evaluation.testing]: copypaste: Task: bbox
[01/08 15:10:05 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[01/08 15:10:05 d2.evaluation.testing]: copypaste: 15.0411,29.3651,13.9482,14.4293,15.1947,nan
[01/08 15:10:05 d2.evaluation.testing]: copypaste: Task: segm
[01/08 15:10:05 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[01/08 15:10:05 d2.evaluation.testing]: copypaste: 0.0063,0.0366,0.0003,0.0065,0.1394,nan

The model does not predict any BBox or Segm Mask for any inference sample I give to it. What could have gone wrong?

Selective logs for reference:

[01/08 15:07:23 fvcore.common.checkpoint]: Saving checkpoint to training_dir/CondInst_MS_R_50_1x/model_final.pth
[01/08 15:07:24 d2.utils.events]:  eta: 0:00:00  iter: 2999  total_loss: 1.592  loss_fcos_cls: 0.2424  loss_fcos_loc: 0.2422  loss_fcos_ctr: 0.6233  loss_mask: 0.4742  time: 2.2331  data_time: 0.0964  lr: 1e-05  max_mem: 8088M
[01/08 15:07:25 d2.engine.hooks]: Overall training speed: 2997 iterations in 1:51:34 (2.2338 s / it)
[01/08 15:07:25 d2.engine.hooks]: Total training time: 1:51:38 (0:00:03 on hooks)
.
.
WARNING [01/08 15:07:27 d2.evaluation.coco_evaluation]: COCO Evaluator instantiated using config, this is deprecated behavior. Please pass tasks in directly
[01/08 15:07:30 d2.evaluation.evaluator]: Inference done 11/512. 0.0943 s / img. ETA=0:01:04
[01/08 15:07:35 d2.evaluation.evaluator]: Inference done 49/512. 0.0988 s / img. ETA=0:01:01
[01/08 15:07:41 d2.evaluation.evaluator]: Inference done 89/512. 0.0965 s / img. ETA=0:00:55
[01/08 15:07:46 d2.evaluation.evaluator]: Inference done 129/512. 0.0958 s / img. ETA=0:00:49
[01/08 15:07:51 d2.evaluation.evaluator]: Inference done 168/512. 0.0956 s / img. ETA=0:00:44
[01/08 15:07:56 d2.evaluation.evaluator]: Inference done 207/512. 0.0955 s / img. ETA=0:00:39
[01/08 15:08:01 d2.evaluation.evaluator]: Inference done 246/512. 0.0955 s / img. ETA=0:00:34
[01/08 15:08:06 d2.evaluation.evaluator]: Inference done 284/512. 0.0960 s / img. ETA=0:00:29
[01/08 15:08:11 d2.evaluation.evaluator]: Inference done 323/512. 0.0958 s / img. ETA=0:00:24
[01/08 15:08:16 d2.evaluation.evaluator]: Inference done 362/512. 0.0956 s / img. ETA=0:00:19
[01/08 15:08:21 d2.evaluation.evaluator]: Inference done 401/512. 0.0955 s / img. ETA=0:00:14
[01/08 15:08:26 d2.evaluation.evaluator]: Inference done 441/512. 0.0953 s / img. ETA=0:00:09
[01/08 15:08:31 d2.evaluation.evaluator]: Inference done 480/512. 0.0952 s / img. ETA=0:00:04
[01/08 15:08:36 d2.evaluation.evaluator]: Total inference time: 0:01:05.932470 (0.130044 s / img per device, on 1 devices)
[01/08 15:08:36 d2.evaluation.evaluator]: Total inference pure compute time: 0:00:48 (0.095079 s / img per device, on 1 devices)
[01/08 15:08:36 d2.evaluation.coco_evaluation]: Preparing results for COCO format ...
[01/08 15:08:36 d2.evaluation.coco_evaluation]: Saving results to training_dir/CondInst_MS_R_50_1x/inference/coco_instances_results.json
[01/08 15:08:36 d2.evaluation.coco_evaluation]: Evaluating predictions with unofficial COCO API...
Loading and preparing results...
DONE (t=0.03s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
COCOeval_opt.evaluate() finished in 0.51 seconds.
Accumulating evaluation results...
COCOeval_opt.accumulate() finished in 0.09 seconds.
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.150
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.294
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.139
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.144
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.152
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.027
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.159
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.268
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.263
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.299
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
[01/08 15:08:37 d2.evaluation.coco_evaluation]: Evaluation results for bbox: 
|   AP   |  AP50  |  AP75  |  APs   |  APm   |  APl  |
|:------:|:------:|:------:|:------:|:------:|:-----:|
| 15.041 | 29.365 | 13.948 | 14.429 | 15.195 |  nan  |
[01/08 15:08:37 d2.evaluation.coco_evaluation]: Some metrics cannot be computed and is shown as NaN.
.
.
Loading and preparing results...
DONE (t=0.38s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *segm*
COCOeval_opt.evaluate() finished in 2.07 seconds.
Accumulating evaluation results...
COCOeval_opt.accumulate() finished in 0.09 seconds.
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.004
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.005
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.002
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
[01/08 15:08:41 d2.evaluation.coco_evaluation]: Evaluation results for segm: 
|  AP   |  AP50  |  AP75  |  APs  |  APm  |  APl  |
|:-----:|:------:|:------:|:-----:|:-----:|:-----:|
| 0.006 | 0.037  | 0.000  | 0.006 | 0.139 |  nan  |
[01/08 15:08:41 d2.evaluation.coco_evaluation]: Some metrics cannot be computed and is shown as NaN.
.
.
[01/08 15:08:41 d2.evaluation.testing]: copypaste: Task: bbox
[01/08 15:08:41 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[01/08 15:08:41 d2.evaluation.testing]: copypaste: 15.0411,29.3651,13.9482,14.4293,15.1947,nan
[01/08 15:08:41 d2.evaluation.testing]: copypaste: Task: segm
[01/08 15:08:41 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[01/08 15:08:41 d2.evaluation.testing]: copypaste: 0.0063,0.0366,0.0003,0.0065,0.1394,nan
[01/08 15:08:41 d2.utils.events]:  eta: 0:00:00  iter: 2999  total_loss: 1.592  loss_fcos_cls: 0.2424  loss_fcos_loc: 0.2422  loss_fcos_ctr: 0.6233  loss_mask: 0.4742  time: 2.2331  data_time: 0.0964  lr: 1e-05  max_mem: 8088M

zeeshanalipanhwar commented 3 years ago

Could any configurations there not have been set rightly?

czq693497091 commented 3 years ago

I also have the same problem. When training the CondInst, the loss is always unstable. And after 90k training of R_50_1x, the loss is about 2.0± and can not detect any targets in COCO dataset.

zeeshanalipanhwar commented 3 years ago

One possible reason could be the way we are loading the data. Not sure where I am going wrong. Yet.

czq693497091 commented 3 years ago

Thanks for you reply. Did you solve this problem? When training coco, my IMS_PER_BATCH=1 because of the limit of the memory, which causes the loss is very unstable. And now I try to increase the batch size by lower the MAX_PROPOSAL from 500 to 200 and clip the images. Then use two gpus to train the model together. How about your suggestions?

------------------ 原始邮件 ------------------ 发件人: "Zeeshan Ali"<notifications@github.com>; 发送时间: 2021年3月7日(星期天) 下午3:09 收件人: "aim-uofa/AdelaiDet"<AdelaiDet@noreply.github.com>; 抄送: "陈振乾"<693497091@qq.com>; "Comment"<comment@noreply.github.com>; 主题: Re: [aim-uofa/AdelaiDet] CondInst gives BBox AP:15 and Segm AP:0.0063 on Evaluation Set after being trained for 3K epochs! (#289)

One possible reason could be the way we are loading the data. Not sure where I am going wrong. Yet.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

zeeshanalipanhwar commented 3 years ago

I did not solve the problem, and have left it there since two months. :)

aim-uofa / AdelaiDet

CondInst gives BBox AP:15 and Segm AP:0.0063 on Evaluation Set after being trained for 3K epochs! #289