dbolya / yolact

A simple, fully convolutional model for real-time instance segmentation.
MIT License
5.01k stars 1.32k forks source link

Long vertical object low confidence #345

Open mateoKutnjak opened 4 years ago

mateoKutnjak commented 4 years ago

Hi. Thank you for your work. Speed and accuracy of YOLACT++ is really amazing.

I have a problem with detecting long and vertical object. Although object detection is not an issue when object's orientation is different, network has problem detecting the object when it is vertically aligned with y axis of the camera.

I believe it is due to configurations in config.py but am unable to find the necessary changes to address the issue.

Please provide your thoughts on this behavior. Thank you in advance.

EDIT: This is result when object is vertical. It has confidence barely above 0.0, but with different orientations has confidence more than 0.50 (this data depicts metrics after 10000 training steps) Screenshot from 2020-02-18 13-12-25 Screenshot from 2020-02-18 13-32-26

abhigoku10 commented 4 years ago

@mateoKutnjak @dbolya i am also facing issues when i try to detection of vertical objects like poles, vertical line along the y axis , is there any explanation on this @dbolya

dbolya commented 4 years ago

Very interesting, and this could actually be fairly important. What's the aspect ratio of your images? That might have something to do with it.

Also as an additional tip, you probably want to use --cross_nms=True in eval.py for your use-case.

abhigoku10 commented 4 years ago

@dbolya i was training with images of resolution 512x640 , the detection of vertical objects were not consistent, i have trained it on 4k train val set using resnet 50 feature backbone for 5lakh iterations

mateoKutnjak commented 4 years ago

@dbolya Image aspect ratio is 16:9 (1280x720). Maybe some of the following configurations are relevant:

'preserve_aspect_ratio': False, 'pred_aspect_ratios': [ [[1, 1/2, 2]] ]5, 'pred_scales': [[i 2 ** (j / 3.0) for j in range(3)] for i in [24, 48, 96, 192, 384]]

abhigoku10 commented 4 years ago

@dbolya any idea how can we debug this behaviour??

mateoKutnjak commented 4 years ago

It is the only flaw of the architecture and because of that not suitable for my use-case. I think it can be solved with changes inside config.py. @dbolya

dbolya commented 4 years ago

@mateoKutnjak I believe the issue is that it's resizing the images to squares, so you're reducing the vertical resolution by a lot. I'll be adding support for fixed non-square images in a week or two, and that should this.

abhigoku10 commented 4 years ago

@dbolya i am also having the issue when i am trying to detect objects like poles, but by using fixed non square images how will this issue get resolved , can you please explain . if it solves the issue of vertical objects will there be effects of other objects detection

mateoKutnjak commented 4 years ago

@dbolya @abhigoku10 I gain significant performance boost when I resize input images from resolution 1280x720 to 550x310 while keeping the aspect ratio unchanged (16:9). The rest of the input is padded with zeros (top and bottom of the RGB image).

Evaluation shows confidence of 0.40 for detection of the vertical object that previous had 0.05 confidence after 50000 steps. Some hyperparams adjustment can be made for fine-tuning of the network performance, but results are acceptable for my use-case.

UPDATE: These are evaluation results after 104000 steps (synthetic dataset created on principles of domain randomization)

       |  all  |  .50  |  .55  |  .60  |  .65  |  .70  |  .75  |  .80  |  .85  |  .90  |  .95  |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
box    | 83.24 | 99.91 | 98.93 | 98.93 | 98.93 | 95.46 | 94.32 | 91.25 | 85.16 | 63.39 |  6.15 |
mask   | 74.72 | 91.56 | 90.06 | 88.67 | 86.02 | 79.68 | 79.11 | 74.94 | 64.90 | 53.05 | 39.20 |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

When inference is tested on real camera stream, long vertical objects is found with confidence of around 0.70

abhigoku10 commented 4 years ago

@mateoKutnjak this result is good , can you please elaborate the changes made to achieve it would be more helpful to me thanks in advance

mateoKutnjak commented 4 years ago

I resized input images from 1280x720 to 550x310 (keeping the aspect ratio constant) and padded the rest of the image with zeros to get final image dimension of 550x550. I am inheriting yolact_plus_base_config with resnet101_dcn_inter_backbone. I have also set 'use_maskiou'=True, 'discard_mask area'=-1 and 'use_mask_scoring'=True.

breznak commented 4 years ago

resized input images from 1280x720 to 550x310 (keeping the aspect ratio constant) and padded the rest of the image with zeros to get final image dimension of 550x550.

Is this manual process any better than having yolact resize the images and keeping keepAspectRatio=True?

Thank you for sharing your research on this topic!

mateoKutnjak commented 4 years ago

I tried with keepAspectRatio=True with 1280x720 dimension and did not find any improvement.

mateoKutnjak commented 4 years ago

I also suggest mask dilatation of long vertical objects with cv2.dilatate. As they appear wider, confidence is consistent despite different orientation, and can be later erorded with cv2.erosion. I am getting significantly better result with this approach.

abhigoku10 commented 4 years ago

@dbolya @mateoKutnjak i am facing a below error when i train only vertical objects with yolact++ , but i am not facing this error when i train with yolact . i have double checked the data annotations also

Multiple GPUs detected! Turning off JIT. Per-GPU batch size is less than the recommended limit for batch norm. Disabling batch norm. loading annotations into memory... Done (t=0.07s) creating index... index created! loading annotations into memory... Done (t=0.01s) creating index... index created! Initializing weights... Begin training!

[ 0] 0 || B: 5.665 | C: 11.077 | M: 7.510 | S: 1.452 | T: 25.704 || ETA: 13 days, 8:32:46 || timer: 5.770 [ 0] 10 || B: 5.379 | C: 6.902 | M: 6.530 | S: 1.207 | T: 20.018 || ETA: 2 days, 3:23:00 || timer: 0.407 [ 0] 20 || B: 5.169 | C: 5.997 | M: 5.993 | S: 0.855 | I: 0.042 | T: 18.056 || ETA: 1 day, 13:52:03 || timer: 0.409 [ 0] 30 || B: 4.984 | C: 5.593 | M: 5.372 | S: 0.612 | I: 0.029 | T: 16.590 || ETA: 1 day, 9:05:07 || timer: 0.410 [ 0] 40 || B: 4.929 | C: 5.291 | M: 5.084 | S: 0.480 | I: 0.022 | T: 15.806 || ETA: 1 day, 6:37:00 || timer: 0.418 [ 0] 50 || B: 4.853 | C: 5.052 | M: 4.879 | S: 0.397 | I: 0.017 | T: 15.198 || ETA: 1 day, 5:22:15 || timer: 0.414 [ 0] 60 || B: 4.795 | C: 4.847 | M: 4.715 | S: 0.339 | I: 0.014 | T: 14.710 || ETA: 1 day, 4:27:56 || timer: 0.412 [ 0] 70 || B: 4.800 | C: 4.672 | M: 4.622 | S: 0.297 | I: 0.012 | T: 14.404 || ETA: 1 day, 3:49:32 || timer: 0.433 [ 0] 80 || B: 4.789 | C: 4.527 | M: 4.525 | S: 0.267 | I: 0.011 | T: 14.119 || ETA: 1 day, 3:16:48 || timer: 0.415 [ 0] 90 || B: 4.774 | C: 4.388 | M: 4.485 | S: 0.243 | I: 0.010 | T: 13.900 || ETA: 1 day, 2:53:04 || timer: 0.420 [ 0] 100 || B: 4.752 | C: 4.208 | M: 4.384 | S: 0.210 | I: 0.009 | T: 13.563 || ETA: 1 day, 2:31:43 || timer: 0.426 [ 0] 110 || B: 4.705 | C: 3.876 | M: 4.147 | S: 0.097 | I: 0.002 | T: 12.827 || ETA: 1 day, 2:17:59 || timer: 0.429 [ 0] 120 || B: 4.651 | C: 3.685 | M: 3.991 | S: 0.055 | I: 0.001 | T: 12.382 || ETA: 1 day, 2:08:11 || timer: 0.439 [ 0] 130 || B: 4.634 | C: 3.506 | M: 3.957 | S: 0.049 | I: 0.000 | T: 12.147 || ETA: 1 day, 2:00:47 || timer: 0.442 [ 0] 140 || B: 4.624 | C: 3.365 | M: 3.917 | S: 0.045 | I: 0.000 | T: 11.951 || ETA: 1 day, 1:53:51 || timer: 0.430 [ 0] 150 || B: 4.611 | C: 3.242 | M: 3.907 | S: 0.044 | I: 0.000 | T: 11.804 || ETA: 1 day, 1:47:11 || timer: 0.435 [ 0] 160 || B: 4.626 | C: 3.147 | M: 3.900 | S: 0.044 | I: 0.000 | T: 11.717 || ETA: 1 day, 1:40:45 || timer: 0.424 [ 0] 170 || B: 4.588 | C: 3.076 | M: 3.895 | S: 0.044 | I: 0.000 | T: 11.603 || ETA: 1 day, 1:36:47 || timer: 0.439 [ 0] 180 || B: 4.581 | C: 3.027 | M: 3.912 | S: 0.043 | I: 0.000 | T: 11.563 || ETA: 1 day, 1:32:59 || timer: 0.446 [ 0] 190 || B: 4.559 | C: 2.992 | M: 3.865 | S: 0.042 | I: 0.000 | T: 11.458 || ETA: 1 day, 1:29:08 || timer: 0.446 [ 0] 200 || B: 4.543 | C: 2.958 | M: 3.825 | S: 0.041 | I: 0.000 | T: 11.367 || ETA: 1 day, 1:25:36 || timer: 0.455 Traceback (most recent call last): File "train.py", line 504, in train() File "train.py", line 307, in train losses = net(datum) File "/home/Documents/venvs/VA/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, **kwargs) File "/home/Documents/venvs/VA/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 144, in forward return self.gather(outputs, self.output_device) File "train.py", line 168, in gather out[k] = torch.stack([output[k].to(output_device) for output in outputs]) File "train.py", line 168, in out[k] = torch.stack([output[k].to(output_device) for output in outputs]) KeyError: 'I'

qjziyou commented 4 years ago

@abhigoku10 I have the same problem when i train yolact++ with coco dataset. Do you fixed it.

abhigoku10 commented 4 years ago

@qjziyou yes i was able to fix it for custom dataset , you should not be able to get this error for coco dataset but i followed the #259

qjziyou commented 4 years ago

@abhigoku10 Thanks for your reply,#259 can solver my problem.

elfpattern commented 4 years ago

@dbolya @mateoKutnjak I have the same question. My pic' s size is about (600, 600), so the ratio of (h,w) is nearly 1, the pic is resized (550, 550), then , image2020-3-14_17-22-33

image2020-3-14_17-24-52

I am inheriting yolact_plus_base_config with resnet50_dcn_inter_backbone. I don't think that I set 'use_maskiou'=True, 'discard_mask area'=-1 and 'use_mask_scoring'=True is useful. beacause mask rescoring rcnn is to imporove mask quality, above image is not detected.

Is there a good solution???

mateoKutnjak commented 4 years ago

@elfpattern Try widening the mask in the preprocessing (with kernel operations) and (optionally) narrowing in the posprocessing. This way confidence will be greater and object will be detected more easily.

abhigoku10 commented 4 years ago

@mateoKutnjak but by widening the mask in preprocessing will still end up in obtaining one more detections right at the top right corner , how to minimize this

mateoKutnjak commented 4 years ago

If you want to find vertical crack more accurately you should widen the object mask of vertical object and not others. For top right corner mask I am not sure what is expected prediction so I cannot say what you should do about that.

abhigoku10 commented 4 years ago

@mateoKutnjak woay but you were mentioning of widening mask in post processing of inference or pre processing for training , isnce i am trying to detect poles but not successfull

mateoKutnjak commented 4 years ago

@elfpattern Try widening the mask in the preprocessing (with kernel operations) and (optionally) narrowing in the posprocessing. This way confidence will be greater and object will be detected more easily.

Widen mask of vertical object in ground truth and feed it to the model

elfpattern commented 4 years ago

@mateoKutnjak get, I will try, but what is the intention?

mateoKutnjak commented 4 years ago

@mateoKutnjak get, I will try, but what is the intention?

Easier detection and greater confidence. I was using this method to solve issue of detection tick on pressure gauge (when tick was in vertical position confidence dropped significantly and object could not be detected below some threshold).

elfpattern commented 4 years ago

@mateoKutnjak ok, Toady I try another idea, i set the acnhor 1:4, 1:1, 4:1, and it sucess, I will try your idea, Thx

VinniaKemala commented 4 years ago

@mateoKutnjak @dbolya Hi, thank you for your suggestion about resizing and zero padding the image.

I tried your idea and modified class Resize(object) on augmentation.py. This is my code.

class Resize(object):
    """ Resize and pad with zeros to get a square image of size [max_dim, max_dim]  """
    @staticmethod
    def calc_size_preserve_ar(img_w, img_h, max_size):        
        # Does it exceed max dim?
        img_max = max(img_h, img_w)
        scale = max_size / img_max
        w = img_w * scale
        h = img_h * scale       
        return int(w), int(h)

    def __init__(self, resize_gt=True):
        self.resize_gt = resize_gt
        self.max_size = cfg.max_size 

    def __call__(self, image, masks, boxes, labels=None):
        img_h, img_w, depth = image.shape

        width, height = Resize.calc_size_preserve_ar(img_w, img_h, self.max_size)
        image = cv2.resize(image, (width, height))

        top_pad = random.uniform(0, (self.max_size - height) // 2)
        left_pad = random.uniform(0, (self.max_size - width) // 2)

        expand_image = np.zeros(
            (int(self.max_size), int(self.max_size), depth),
            dtype=image.dtype)

        expand_image[int(top_pad):int(top_pad + height),
                     int(left_pad):int(left_pad + width)] = image
        image = expand_image

        if self.resize_gt:
            masks = masks.transpose((1, 2, 0))
            masks = cv2.resize(masks, (width, height))

            # OpenCV resizes a (w,h,1) array to (s,s), so fix that
            if len(masks.shape) == 2:
                masks = np.expand_dims(masks, 0)
            else:
                masks = masks.transpose((2, 0, 1))

            expand_masks = np.zeros(
                (masks.shape[0], int(self.max_size), int(self.max_size)),
                dtype=masks.dtype)
            expand_masks[:,int(top_pad):int(top_pad + height),
                           int(left_pad):int(left_pad + width)] = masks
            masks = expand_masks

            # extract boxes from masks
            boxes = boxes.copy()
            boxes = extract_bboxes(masks)

        # Discard boxes that are smaller than we'd like
        w = boxes[:, 2] - boxes[:, 0]
        h = boxes[:, 3] - boxes[:, 1]

        keep = (w > cfg.discard_box_width) * (h > cfg.discard_box_height)
        masks = masks[keep]
        boxes = boxes[keep]
        labels['labels'] = labels['labels'][keep]
        labels['num_crowds'] = (labels['labels'] < 0).sum()

        return image, masks, boxes, labels

But I think there's a bug in my code because I get this weird val mAP. I suspect it's somewhere when self.resize_gt=False since it affects class BaseTransform(object), a transform to be used when evaluating.

Calculating mAP...

       |  all  |  .50  |  .55  |  .60  |  .65  |  .70  |  .75  |  .80  |  .85  |  .90  |  .95  |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
   box |  1.07 |  4.73 |  2.85 |  1.55 |  0.83 |  0.43 |  0.20 |  0.12 |  0.02 |  0.00 |  0.00 |
  mask |  0.39 |  1.75 |  1.03 |  0.60 |  0.31 |  0.15 |  0.07 |  0.02 |  0.00 |  0.00 |  0.00 |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

Could you share your code? Or maybe do you have any suggestions to fix my code?

Thank you.

abhigoku10 commented 4 years ago

@mateoKutnjak ok, Toady I try another idea, i set the acnhor 1:4, 1:1, 4:1, and it sucess, I will try your idea, Thx

you generated the anchors only for long vertical objects or the training set had other objects which had width also

mateoKutnjak commented 4 years ago

@VinniaKemala My dataset generation consist of Blender rendering and converting raw mask and RGB to COCO format. Here is my code for resizing raw RGB and mask before doing further conversion to COCO format with pycocotools)

def resize(rgb, mask, w, h):

    original_h = rgb.shape[0]
    original_w = rgb.shape[1]

    percent_decrease_w = w/rgb.shape[1]
    percent_decrease_h = h/rgb.shape[0]

    min_decrease = min(percent_decrease_h, percent_decrease_w)

    new_w = round(original_w*min_decrease / 2)*2
    new_h = round(original_h*min_decrease / 2)*2

    rgb = cv2.resize(rgb, (new_w, new_h), interpolation=cv2.INTER_AREA)
    mask = cv2.resize(mask, (new_w, new_h), interpolation=cv2.INTER_AREA)

    vertical_pad = max(0, h-new_h) // 2
    horizontal_pad = max(0, w-new_w) // 2

    rgb = np.pad(rgb, ((vertical_pad, vertical_pad), (horizontal_pad, horizontal_pad), (0, 0)),
                 mode='constant', constant_values=0)
    mask = np.pad(mask, ((vertical_pad, vertical_pad), (horizontal_pad, horizontal_pad)),
                 mode='constant', constant_values=0)
    return rgb, mask

I have to repeat that I am not doing resizing in DataLoader class. I am resizing images in my dataset.

Zhang-O commented 4 years ago

I also suggest mask dilatation of long vertical objects with cv2.dilatate. As they appear wider, confidence is consistent despite different orientation, and can be later erorded with cv2.erosion. I am getting significantly better result with this approach.

Could I ask how you revise your mask annotation of train data when your change your train images using cv2.dialte or cv2,erode?

Zhang-O commented 4 years ago

@mateoKutnjak ok, Toady I try another idea, i set the acnhor 1:4, 1:1, 4:1, and it sucess, I will try your idea, Thx

what about 1:3 1:1 3:1

mateoKutnjak commented 4 years ago

I also suggest mask dilatation of long vertical objects with cv2.dilatate. As they appear wider, confidence is consistent despite different orientation, and can be later erorded with cv2.erosion. I am getting significantly better result with this approach.

Could I ask how you revise your mask annotation of train data when your change your train images using cv2.dialte or cv2,erode?

Extract mask of thin vertical object where background is equal to zero and object mask is greater then 0. Perform cv2.dilate on this mask. Now you have new raw mask. Use pycocotools to turn this mask to COCO format. Follow this guide: https://www.immersivelimit.com/tutorials/create-coco-annotations-from-scratch

https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_morphological_ops/py_morphological_ops.html

Feed this mask as ground truth to network and train the network.

Optionally, when training is finished and you are doing inference only, you can do erosion of mask got from inference because mask trained on dilated mask will provide dilated prediction. In my use case it was not necessary to perform erosion. This is why I say it is optional.

BartvanMarrewijk commented 4 years ago

I tried above mentioned suggestions with four different scenarios. My input image are 1280x720 except in the resized scenario. By only changing the aspect ratio the best result was obtained. This ratio was based on a rough guess. I am still looking for an optimal way to determine the aspect ratio. k-means clustering did not work because of outliers. Did anyone tried to optimise the parameters?

Normal images with normal aspect ratio: | all | .50 | .55 | .60 | .65 | .70 | .75 | .80 | .85 | .90 | .95 | -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+ box | 16.42 | 44.07 | 35.91 | 29.89 | 25.04 | 15.77 | 8.10 | 3.90 | 1.50 | 0.01 | 0.00 | mask | 7.77 | 24.70 | 19.62 | 15.58 | 8.66 | 5.87 | 3.15 | 0.11 | 0.00 | 0.00 | 0.00 | -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

Normal images with aspect ratio 0.1 0.5 1: | all | .50 | .55 | .60 | .65 | .70 | .75 | .80 | .85 | .90 | .95 | -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+ box | 25.25 | 57.82 | 53.88 | 44.92 | 39.15 | 30.71 | 19.17 | 4.89 | 1.65 | 0.29 | 0.00 | mask | 15.64 | 42.77 | 39.41 | 32.04 | 22.22 | 13.09 | 5.52 | 1.31 | 0.00 | 0.00 | 0.00 |

Resized images and padding (similar as method of @mateoKutnjak) with normal aspect ratio: | all | .50 | .55 | .60 | .65 | .70 | .75 | .80 | .85 | .90 | .95 | -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+ box | 18.69 | 39.36 | 36.44 | 32.55 | 29.05 | 22.73 | 13.95 | 7.99 | 4.27 | 0.59 | 0.00 | mask | 3.09 | 11.53 | 8.67 | 5.21 | 3.92 | 1.23 | 0.28 | 0.02 | 0.00 | 0.00 | 0.00 |

Resized images and padding (similar as method of @mateoKutnjak) with aspect ratio 0.1 0.5 1: | all | .50 | .55 | .60 | .65 | .70 | .75 | .80 | .85 | .90 | .95 | -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+ box | 26.61 | 53.97 | 50.93 | 44.19 | 35.90 | 32.36 | 25.74 | 15.03 | 5.50 | 2.44 | 0.00 | mask | 8.88 | 27.31 | 21.80 | 19.67 | 12.77 | 5.58 | 1.65 | 0.00 | 0.00 | 0.00 | 0.00 | -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+