Closed JayFu closed 4 years ago
As you know, we do not help users train models.
My suggestion is to use configs that are known to work (e.g. the official ones or the one used by the official tutorial) rather than making random changes. Your changes such as cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = (1)
are simply bad.
Oh ok thanks anyway. Can you tell me any posible reason resulting in this?
Using cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = (1)
is one possible reason.
Hello there,
Thanks for all the brilliant work! I read a lot of issues and documents and that provided me a lot of help. But I didn't find any issue similar to mine.
Let me describe the issue briefly at first and I will also paste the code, command and logs later. I was trying to apply this work on my own dataset which includes 100k images, and I regisitered it by
My plan was to train it 300k iterations, 2 images for each batch based on
At first, the training looks work well. The total loss start with about 16 but shapely decrease to 0.3 after hundreds of iterations. For the rest of 290k iterations the loss keep shaking at 0.1 - 0.2 (I think this means that it didn't work for it didn't even acieve an epoch.)
And after all the iterations, I try to evaluate the model. But it didn't work indeed. The code
almostly return nothing except image width and height.
I try to make the output visilble and nothing drawn on image.
Instructions To Reproduce the Issue:
what changes you made (
git diff
) or what code you wrote@@ -114,8 +118,28 @@ def setup(args): Create configs and perform basic setups. """ cfg = get_cfg()
cfg.merge_from_file(args.config_file)
cfg.MODEL.WEIGHTS = "catalog://ImageNetPretrained/FAIR/X-152-32x8d-IN5k"
predictor = DefaultPredictor(cfg)
cfg.TEST.AUG.ENABLED = True
cfg.freeze() default_setup(cfg, args) return cfg @@ -123,7 +147,33 @@ def setup(args):
def main(args): cfg = setup(args)
args.eval_only = True
vg_metadata = MetadataCatalog.get("VG_100K")
predictor = DefaultPredictor(cfg)
data_f = "./VG_100K_data/VG_100K/1.jpg"
im = cv2.imread(data_f)
res = predictor(im)
v = Visualizer(im[:, :, ::-1],
metadata=vg_metadata,
scale=0.8,
instance_mode=ColorMode.IMAGE_BW # remove the colors of unsegmented pixels
)
v = v.draw_instance_predictions(res["instances"].to("cpu"))
img = v.get_image()[:, :, ::-1]
cv2.imwrite("vg_test.jpg", img)
exit()
if args.eval_only: model = Trainer.build_model(cfg) DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load( @@ -141,7 +191,7 @@ def main(args): consider writing your own training loop or subclassing the trainer. """ trainer = Trainer(cfg)
trainer.resume_or_load(resume=True) if cfg.TEST.AUG.ENABLED: trainer.register_hooks( [hooks.EvalHook(0, lambda: trainer.test_with_TTA(cfg, trainer.model))] @@ -150,8 +200,10 @@ def main(args):
if name == "main":
Training by
Evaluate by
[12/20 09:21:01] d2.data.dataset_mapper INFO: CropGen used in training: RandomCrop(crop_type='relative_range', crop_size=[0.9, 0.9]) [12/20 09:21:01] d2.data.detection_utils INFO: TransformGens used in training: [ResizeShortestEdge(short_edge_length=(640, 864), max_size=1440, sample_style='range'), RandomFlip()] [12/20 09:21:01] d2.data.build INFO: Using training sampler TrainingSampler [12/20 09:21:02] fvcore.common.checkpoint INFO: Loading checkpoint from ./it-30w_lr-0.0025_bat-2/model_0099999.pth [12/20 09:21:06] fvcore.common.checkpoint INFO: Loading optimizer from ./it-30w_lr-0.0025_bat-2/model_0099999.pth [12/20 09:21:07] fvcore.common.checkpoint INFO: Loading scheduler from ./it-30w_lr-0.0025_bat-2/model_0099999.pth [12/20 09:21:07] d2.engine.train_loop INFO: Starting training from iteration 100000 [12/20 09:21:53] d2.utils.events INFO: eta: 5 days, 11:57:20 iter: 100019 total_loss: 0.242 loss_cls_stage0: 0.000 loss_box_reg_stage0: 0.000 loss_cls_stage1: 0.000 loss_box_reg_stage1: 0.000 loss_cls_stage2: 0.000 loss_box_reg_stage2: 0.000 loss_mask: 0.000 loss_rpn_cls: 0.091 loss_rpn_loc: 0.146 time: 2.2641 data_time: 0.0020 lr: 0.000025 max_mem: 5414M [12/20 09:22:37] d2.utils.events INFO: eta: 4 days, 23:53:27 iter: 100039 total_loss: 0.246 loss_cls_stage0: 0.000 loss_box_reg_stage0: 0.000 loss_cls_stage1: 0.000 loss_box_reg_stage1: 0.000 loss_cls_stage2: 0.000 loss_box_reg_stage2: 0.000 loss_mask: 0.000 loss_rpn_cls: 0.099 loss_rpn_loc: 0.153 time: 2.2323 data_time: 0.0017 lr: 0.000025 max_mem: 5993M ... ... [12/25 15:35:56] d2.utils.events INFO: eta: 0:02:20 iter: 299939 total_loss: 0.256 loss_cls_stage0: 0.000 loss_box_reg_stage0: 0.000 loss_cls_stage1: 0.000 loss_box_reg_stage1: 0.000 loss_cls_stage2: 0.000 loss_box_reg_stage2: 0.000 loss_mask: 0.000 loss_rpn_cls: 0.094 loss_rpn_loc: 0.156 time: 2.2686 data_time: 0.0019 lr: 0.000025 max_mem: 6436M [12/25 15:36:41] d2.utils.events INFO: eta: 0:01:34 iter: 299959 total_loss: 0.228 loss_cls_stage0: 0.000 loss_box_reg_stage0: 0.000 loss_cls_stage1: 0.000 loss_box_reg_stage1: 0.000 loss_cls_stage2: 0.000 loss_box_reg_stage2: 0.000 loss_mask: 0.000 loss_rpn_cls: 0.076 loss_rpn_loc: 0.128 time: 2.2686 data_time: 0.0021 lr: 0.000025 max_mem: 6436M [12/25 15:37:26] d2.utils.events INFO: eta: 0:00:48 iter: 299979 total_loss: 0.207 loss_cls_stage0: 0.000 loss_box_reg_stage0: 0.000 loss_cls_stage1: 0.000 loss_box_reg_stage1: 0.000 loss_cls_stage2: 0.000 loss_box_reg_stage2: 0.000 loss_mask: 0.000 loss_rpn_cls: 0.081 loss_rpn_loc: 0.062 time: 2.2686 data_time: 0.0023 lr: 0.000025 max_mem: 6436M [12/25 15:38:11] fvcore.common.checkpoint INFO: Saving checkpoint to ./it-30w_lr-0.0025_bat-2_t/model_0299999.pth [12/25 15:38:14] fvcore.common.checkpoint INFO: Saving checkpoint to ./it-30w_lr-0.0025_bat-2_t/model_final.pth [12/25 15:38:17] d2.utils.events INFO: eta: 0:00:02 iter: 299999 total_loss: 0.226 loss_cls_stage0: 0.000 loss_box_reg_stage0: 0.000 loss_cls_stage1: 0.000 loss_box_reg_stage1: 0.000 loss_cls_stage2: 0.000 loss_box_reg_stage2: 0.000 loss_mask: 0.000 loss_rpn_cls: 0.082 loss_rpn_loc: 0.123 time: 2.2686 data_time: 0.0020 lr: 0.000025 max_mem: 6436M [12/25 15:38:17] d2.engine.hooks INFO: Overall training speed: 199997 iterations in 5 days, 6:01:56 (2.2686 s / it) [12/25 15:38:17] d2.engine.hooks INFO: Total training time: 5 days, 6:17:04 (0:15:07 on hooks)
... ... [01/02 15:41:58 d2.evaluation.evaluator]: Start inference on 51 images [01/02 15:42:27 d2.evaluation.evaluator]: Inference done 50/51. 0.5634 s / img. ETA=0:00:00 [01/02 15:42:27 d2.evaluation.evaluator]: Total inference time: 0:00:26 (0.565217 s / img per device, on 1 devices) [01/02 15:42:27 d2.evaluation.evaluator]: Total inference pure compute time: 0:00:25 (0.561752 s / img per device, on 1 devices) [01/02 15:42:27 d2.evaluation.coco_evaluation]: Preparing results for COCO format ... [01/02 15:42:27 d2.evaluation.coco_evaluation]: Saving results to ./it-30w_lr-0.0025_bat-2_t/inference/coco_instances_results.json [01/02 15:42:27 d2.evaluation.coco_evaluation]: Evaluating predictions ... WARNING [01/02 15:42:27 d2.evaluation.coco_evaluation]: No predictions from the model! Set scores to -1 WARNING [01/02 15:42:27 d2.evaluation.coco_evaluation]: No predictions from the model! Set scores to -1 [01/02 15:42:27 d2.engine.defaults]: Evaluation results for small_vg in csv format: [01/02 15:42:27 d2.evaluation.testing]: copypaste: Task: bbox [01/02 15:42:27 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl [01/02 15:42:27 d2.evaluation.testing]: copypaste: -1.0000,-1.0000,-1.0000,-1.0000,-1.0000,-1.0000 [01/02 15:42:27 d2.evaluation.testing]: copypaste: Task: segm [01/02 15:42:27 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl [01/02 15:42:27 d2.evaluation.testing]: copypaste: -1.0000,-1.0000,-1.0000,-1.0000,-1.0000,-1.0000 ERROR [01/02 15:42:27 d2.evaluation.testing]: Result verification failed! ERROR [01/02 15:42:27 d2.evaluation.testing]: Expected Results: [['bbox', 'AP', 38.5, 0.2]] ERROR [01/02 15:42:27 d2.evaluation.testing]: Actual Results: OrderedDict([('bbox', {'AP': -1, 'AP50': -1, 'AP75': -1, 'APl': -1, 'APm': -1, 'APs': -1}), ('segm', {'AP': -1, 'AP50': -1, 'AP75': -1, 'APl': -1, 'APm': -1, 'APs': -1})])
PyTorch built with:
Thanks for reading.