Reproducing Cityscapes results

DdeGeus commented 3 years ago

Thanks for providing the code for Panoptic FCN, I think it’s a very interesting method!

As this repo only provides the training/evaluation code for COCO, I adapted the code to work for Cityscapes. In this fork, I added a data pipeline for Cityscapes, changed the configs as mentioned in the CVPR paper, and made some other minor necessary changes in the code.

However, when I run with the mentioned config, I cannot reproduce the results on Cityscapes as reported in the paper. Especially for the things classes, the difference is quite large. Specifically, this is the difference in performance:

Method	Backbone	PQ	PQ_th	PQ_st
Panoptic FCN (reported in paper)	R-50	59.6	52.1	65.1
Panoptic FCN (reproduced in our fork)	R-50	58.1	49.3	64.6

To be more specific, the following changes are made to the config yaml:

MODEL:
  LOSS_WEIGHT:
    SEGMENT: 4.0
  POSITION_HEAD:
    THING:
      NUM_CLASSES: 8
    STUFF:
      NUM_CLASSES: 12
      WITH_THING: True
  SEM_SEG_HEAD:
    NUM_CLASSES: 12
  KERNEL_HEAD:
    INSTANCE_SCALES: ((1, 128), (64, 256), (128, 512), (256, 1024), (512, 2048),)
  TENSOR_DIM: 150  # Adapted because max inst. per img > 100

SOLVER:
  BASE_LR: 0.02
  IMS_PER_BATCH: 32
  MAX_ITER: 65000

INPUT:
  MIN_SIZE_TRAIN: (512, 2048)
  MIN_SIZE_TRAIN_SAMPLING: "range"
  MIN_SIZE_TEST: 1024
  MAX_SIZE_TRAIN: 4096
  MAX_SIZE_TEST: 2048
  CROP:
    ENABLED: True
    TYPE: "absolute"
    SIZE: (512, 1024)

Is there something that I am missing? Are there further changes that need to be made to the code/config to be able to reproduce the Cityscapes result?

In issue #20 I read that you mention that the SEM_SEG_HEAD should be trained on all 19 classes instead of just the 12 classes (11 stuff + 1 for all things). Is this how you train it to produce the results you report in the paper? Did you notice a difference in performance for both versions?

Looking forward to your response. Thanks!

yanwei-li commented 3 years ago

Hi, thanks for your interest in our work! The differences with your config are given below. Most of them are adjusted according to the data properties in the cityscapes dataset. And we use 19 classes for stuff training, which would bring some performance gain.

MODEL:
  POSITION_HEAD:
    THING:
      NUM_CLASSES: 19
      THRES: 0.01
      TOP_NUM: 200
    STUFF:
      THRES: 0.1
  INFERENCE:
    INST_THRES=0.5
    SIMILAR_THRES=0.97
    COMBINE:
       STUFF_AREA_LIMIT=2048

INPUT:  
  MIN_SIZE_TRAIN: (512, 768, 1024, 1152, 1216, 1344, 1408, 1536, 1664, 1728, 1856, 1920, 2048)
  MIN_SIZE_TRAIN_SAMPLING: "choice"
  MIN_SIZE_TEST: 1024 
  MAX_SIZE_TRAIN: 4096  
  MAX_SIZE_TEST: 2048
  CROP:
    ENABLED: True
    TYPE: "with_instance" # implemented by ourself to ensure some instances in the crops
    SIZE: (512, 1024)

Maybe you can try this for the result. Hope this could help you.

ShihuaHuang95 commented 3 years ago

@DdeGeus Have you reproduced the results?

DdeGeus commented 3 years ago

Thanks for the quick response, and for providing the more detailed hyperparameters! I will apply the changes and check the performance.

@ShihuaHuang95 I will train the network in the next days and report the results here.

ShihuaHuang95 commented 3 years ago

@DdeGeus Look forward to it!

DdeGeus commented 3 years ago

I applied the changes to the config as you indicated, except for the with_instance crop type. The performance improved, but I'm still not able to reproduce the results as reported in the paper. The things PQ has increased signficantly, but the stuff PQ has decreased.

More specifically:

Method	Backbone	PQ	PQ_th	PQ_st
Panoptic FCN (reported in paper)	R-50	59.6	52.1	65.1
Panoptic FCN (reproduced v1)	R-50	58.1	49.3	64.6
Panoptic FCN (reproduced new)	R-50	58.8	51.5	64.1

Could you think of a reason why this is the case? I guess the things PQ could be improved slightly if I implement the with_instance crop data augmentation, but I suppose this will not impact the performance on stuff classes.

In an attempt to find the potential issue, I visualize the predictions. Below I show an example with a typical result, and the corresponding GT (image is _frankfurt_000000015676). In the prediction, the boundaries of individual things instances are shown in white. Especially for things classes, the network often groups objects that are close to each other, and belong to the same class (on the right side). Additionally, it sometimes predicts weird mask shapes (see object near the center). For stuff, I cannot find such clear trends.

Do you have an idea how to improve the performance so that the PQ matches the reported scores? If it helps, the most recent code is in this fork.

frankfurt_000000_015676 frankfurt_000000_015676_gtFine_instanceIds

yanwei-li commented 3 years ago

Hi, the only difference I can find is the gradient value, which is set to 15.0 in my own config. Actually, the segmentation result in Cityscapes dataset could be unstable, that's why we use COCO for ablation study. Whatever, I'll try to train your repo on my machine and reply if have the result.

DdeGeus commented 3 years ago

Okay that is good to know. Thanks for trying to train my repo! In the meantime, I will try to implement the "with_instance" cropping method, set the gradient clipping value to 15.0, and try some more runs.

yanwei-li commented 3 years ago

Hi, I have tried your repo with gradient clipping value 15.0. And the result is reported below. As presented in the table, PQ_st is close to the reported result. And the PQ_th could be attributed to the lack of instance annotation (different crop manner).

	PQ	SQ	RQ	#categories
All	59.233	79.738	73.089	19
Things	51.550	78.339	65.546	8
Stuff	64.820	80.755	78.574	11

DdeGeus commented 3 years ago

Over the past weeks, I tried various configs, and I also implemented the "with_instance" cropping method. However, I could still not achieve the reported performance, especially for things. I did manage to boost the stuff performance, by doubling the learning rate and using a batch size of 16. However, this hurt the performance on things classes.

The results:

Method	Backbone	PQ	PQ_th	PQ_st
Panoptic FCN (reported in paper)	R-50	59.6	52.1	65.1
Panoptic FCN (with orig config)	R-50	59.0	51.3	64.6
Panoptic FCN (130k steps, batch size 16)	R-50	59.4	50.9	65.5

As you can see, when compared with the previous best result I got (51.5 PQ for things), the cropping method has not resulted in a better performance. Currently, I have implemented it as follows: a crop should contain at least 1 pixel of a things instance. I also tried different thresholds (100 or 1000), but this did not give better results. Is this also how you implement this, or is it different? And would it perhaps be possible to share your code for this cropping method?

I also noticed that, like you mentioned, the performance on Cityscapes is very unstable. I haven't found a solution for this yet. Have you tried lower learning rates/longer warmup/better normalization or other methods to fix this?

yanwei-li commented 3 years ago

Hi, thanks for your effort! I'll try to convert the "with_instance" cropping method to Detectron2 and give the response in several days. And I'll also try to reproduce the reported result in Detectron2 with the modified config.

yanwei-li commented 2 years ago

Hi, I have implement with_instance to Detectron2 as given in the attachment. However, I found that the performance on Detectron2 is 59.4 PQ [model] [metrics] (CLIP_VALUE=5.0). I'm not sure if it is the reason for the difference in the platform. Whatever, I'll change the performance with R50 to 59.4 PQ in the revision.

# please add and declare this function to detectron2/data/transforms/augmentation_impl.py
# use this augmentation in your own dataset_mapper
class RandomCropWithInstance(RandomCrop):
    """
    Make sure the cropping region contains the center of a random instance from annotations.
    """
    def get_transform(self, image, boxes=None):
        # boxes: list of boxes with mode BoxMode.XYXY_ABS
        h, w = image.shape[:2]
        croph, cropw = self.get_crop_size((h, w))

        assert h >= croph and w >= cropw, "Shape computation in {} has bugs.".format(self)
        offset_range_h = max(h - croph, 0)
        offset_range_w = max(w - cropw, 0)
        # Make sure there is always at least one instance in the image
        assert boxes is not None, "Can not get annotations infos."
        if len(boxes) == 0:
            h0 = np.random.randint(h - croph + 1)
            w0 = np.random.randint(w - cropw + 1) 
        else:
            rand_idx = np.random.randint(0, high=len(boxes))
            bbox = torch.tensor(boxes[rand_idx])
            center_xy = (bbox[:2] + bbox[2:]) / 2.0
            offset_range_h_min = max(center_xy[1] - croph, 0)
            offset_range_w_min = max(center_xy[0] - cropw, 0)
            offset_range_h_max = max(min(offset_range_h, center_xy[1] - 1), offset_range_h_min)
            offset_range_w_max = max(min(offset_range_w, center_xy[0] - 1), offset_range_w_min)

            h0 = np.random.randint(offset_range_h_min, offset_range_h_max + 1)
            w0 = np.random.randint(offset_range_w_min, offset_range_w_max + 1)
        return CropTransform(w0, h0, cropw, croph)

razvanbarbura commented 2 years ago

HI @DdeGeus , thank you for the Cityscapes fork you have done. I have trained Panoptic FCN using your work here. I tried to use the demo.py from detectron2 to infer the results on new images but I got the following errors described here. Can you tell me please what should I change to be able to use the demo.py from detectron2 for your work? I saw that in this thread you already uploaded images on which the inference process was run.

dvlab-research / PanopticFCN

Reproducing Cityscapes results #21