Closed DdeGeus closed 2 years ago
Hi, thanks for your interest in our work! The differences with your config are given below. Most of them are adjusted according to the data properties in the cityscapes dataset. And we use 19 classes for stuff training, which would bring some performance gain.
MODEL:
POSITION_HEAD:
THING:
NUM_CLASSES: 19
THRES: 0.01
TOP_NUM: 200
STUFF:
THRES: 0.1
INFERENCE:
INST_THRES=0.5
SIMILAR_THRES=0.97
COMBINE:
STUFF_AREA_LIMIT=2048
INPUT:
MIN_SIZE_TRAIN: (512, 768, 1024, 1152, 1216, 1344, 1408, 1536, 1664, 1728, 1856, 1920, 2048)
MIN_SIZE_TRAIN_SAMPLING: "choice"
MIN_SIZE_TEST: 1024
MAX_SIZE_TRAIN: 4096
MAX_SIZE_TEST: 2048
CROP:
ENABLED: True
TYPE: "with_instance" # implemented by ourself to ensure some instances in the crops
SIZE: (512, 1024)
Maybe you can try this for the result. Hope this could help you.
@DdeGeus Have you reproduced the results?
Thanks for the quick response, and for providing the more detailed hyperparameters! I will apply the changes and check the performance.
@ShihuaHuang95 I will train the network in the next days and report the results here.
@DdeGeus Look forward to it!
I applied the changes to the config as you indicated, except for the with_instance
crop type. The performance improved, but I'm still not able to reproduce the results as reported in the paper. The things PQ has increased signficantly, but the stuff PQ has decreased.
More specifically:
Method | Backbone | PQ | PQ_th | PQ_st |
---|---|---|---|---|
Panoptic FCN (reported in paper) | R-50 | 59.6 | 52.1 | 65.1 | Panoptic FCN (reproduced v1) | R-50 | 58.1 | 49.3 | 64.6 | Panoptic FCN (reproduced new) | R-50 | 58.8 | 51.5 | 64.1 |
Could you think of a reason why this is the case? I guess the things PQ could be improved slightly if I implement the with_instance
crop data augmentation, but I suppose this will not impact the performance on stuff classes.
In an attempt to find the potential issue, I visualize the predictions. Below I show an example with a typical result, and the corresponding GT (image is _frankfurt_000000015676). In the prediction, the boundaries of individual things instances are shown in white. Especially for things classes, the network often groups objects that are close to each other, and belong to the same class (on the right side). Additionally, it sometimes predicts weird mask shapes (see object near the center). For stuff, I cannot find such clear trends.
Do you have an idea how to improve the performance so that the PQ matches the reported scores? If it helps, the most recent code is in this fork.
Hi, the only difference I can find is the gradient value, which is set to 15.0 in my own config. Actually, the segmentation result in Cityscapes dataset could be unstable, that's why we use COCO for ablation study. Whatever, I'll try to train your repo on my machine and reply if have the result.
Okay that is good to know. Thanks for trying to train my repo! In the meantime, I will try to implement the "with_instance"
cropping method, set the gradient clipping value to 15.0, and try some more runs.
Hi, I have tried your repo with gradient clipping value 15.0. And the result is reported below. As presented in the table, PQ_st is close to the reported result. And the PQ_th could be attributed to the lack of instance annotation (different crop manner).
PQ | SQ | RQ | #categories | |
---|---|---|---|---|
All | 59.233 | 79.738 | 73.089 | 19 |
Things | 51.550 | 78.339 | 65.546 | 8 |
Stuff | 64.820 | 80.755 | 78.574 | 11 |
Over the past weeks, I tried various configs, and I also implemented the "with_instance"
cropping method. However, I could still not achieve the reported performance, especially for things. I did manage to boost the stuff performance, by doubling the learning rate and using a batch size of 16. However, this hurt the performance on things classes.
The results:
Method | Backbone | PQ | PQ_th | PQ_st |
---|---|---|---|---|
Panoptic FCN (reported in paper) | R-50 | 59.6 | 52.1 | 65.1 | Panoptic FCN (with orig config) | R-50 | 59.0 | 51.3 | 64.6 | Panoptic FCN (130k steps, batch size 16) | R-50 | 59.4 | 50.9 | 65.5 |
As you can see, when compared with the previous best result I got (51.5 PQ for things), the cropping method has not resulted in a better performance. Currently, I have implemented it as follows: a crop should contain at least 1 pixel of a things instance. I also tried different thresholds (100 or 1000), but this did not give better results. Is this also how you implement this, or is it different? And would it perhaps be possible to share your code for this cropping method?
I also noticed that, like you mentioned, the performance on Cityscapes is very unstable. I haven't found a solution for this yet. Have you tried lower learning rates/longer warmup/better normalization or other methods to fix this?
Hi, thanks for your effort! I'll try to convert the "with_instance"
cropping method to Detectron2 and give the response in several days. And I'll also try to reproduce the reported result in Detectron2 with the modified config.
Hi, I have implement with_instance
to Detectron2 as given in the attachment. However, I found that the performance on Detectron2 is 59.4 PQ [model] [metrics] (CLIP_VALUE=5.0
). I'm not sure if it is the reason for the difference in the platform. Whatever, I'll change the performance with R50 to 59.4 PQ in the revision.
# please add and declare this function to detectron2/data/transforms/augmentation_impl.py
# use this augmentation in your own dataset_mapper
class RandomCropWithInstance(RandomCrop):
"""
Make sure the cropping region contains the center of a random instance from annotations.
"""
def get_transform(self, image, boxes=None):
# boxes: list of boxes with mode BoxMode.XYXY_ABS
h, w = image.shape[:2]
croph, cropw = self.get_crop_size((h, w))
assert h >= croph and w >= cropw, "Shape computation in {} has bugs.".format(self)
offset_range_h = max(h - croph, 0)
offset_range_w = max(w - cropw, 0)
# Make sure there is always at least one instance in the image
assert boxes is not None, "Can not get annotations infos."
if len(boxes) == 0:
h0 = np.random.randint(h - croph + 1)
w0 = np.random.randint(w - cropw + 1)
else:
rand_idx = np.random.randint(0, high=len(boxes))
bbox = torch.tensor(boxes[rand_idx])
center_xy = (bbox[:2] + bbox[2:]) / 2.0
offset_range_h_min = max(center_xy[1] - croph, 0)
offset_range_w_min = max(center_xy[0] - cropw, 0)
offset_range_h_max = max(min(offset_range_h, center_xy[1] - 1), offset_range_h_min)
offset_range_w_max = max(min(offset_range_w, center_xy[0] - 1), offset_range_w_min)
h0 = np.random.randint(offset_range_h_min, offset_range_h_max + 1)
w0 = np.random.randint(offset_range_w_min, offset_range_w_max + 1)
return CropTransform(w0, h0, cropw, croph)
HI @DdeGeus , thank you for the Cityscapes fork you have done. I have trained Panoptic FCN using your work here. I tried to use the demo.py from detectron2 to infer the results on new images but I got the following errors described here. Can you tell me please what should I change to be able to use the demo.py from detectron2 for your work? I saw that in this thread you already uploaded images on which the inference process was run.
Thanks for providing the code for Panoptic FCN, I think it’s a very interesting method!
As this repo only provides the training/evaluation code for COCO, I adapted the code to work for Cityscapes. In this fork, I added a data pipeline for Cityscapes, changed the configs as mentioned in the CVPR paper, and made some other minor necessary changes in the code.
However, when I run with the mentioned config, I cannot reproduce the results on Cityscapes as reported in the paper. Especially for the things classes, the difference is quite large. Specifically, this is the difference in performance:
To be more specific, the following changes are made to the config yaml:
Is there something that I am missing? Are there further changes that need to be made to the code/config to be able to reproduce the Cityscapes result?
In issue #20 I read that you mention that the
SEM_SEG_HEAD
should be trained on all 19 classes instead of just the 12 classes (11 stuff + 1 for all things). Is this how you train it to produce the results you report in the paper? Did you notice a difference in performance for both versions?Looking forward to your response. Thanks!