Open mateoKutnjak opened 4 years ago
@mateoKutnjak @dbolya i am also facing issues when i try to detection of vertical objects like poles, vertical line along the y axis , is there any explanation on this @dbolya
Very interesting, and this could actually be fairly important. What's the aspect ratio of your images? That might have something to do with it.
Also as an additional tip, you probably want to use --cross_nms=True
in eval.py for your use-case.
@dbolya i was training with images of resolution 512x640 , the detection of vertical objects were not consistent, i have trained it on 4k train val set using resnet 50 feature backbone for 5lakh iterations
@dbolya Image aspect ratio is 16:9 (1280x720). Maybe some of the following configurations are relevant:
'preserve_aspect_ratio': False, 'pred_aspect_ratios': [ [[1, 1/2, 2]] ]5, 'pred_scales': [[i 2 ** (j / 3.0) for j in range(3)] for i in [24, 48, 96, 192, 384]]
@dbolya any idea how can we debug this behaviour??
It is the only flaw of the architecture and because of that not suitable for my use-case. I think it can be solved with changes inside config.py. @dbolya
@mateoKutnjak I believe the issue is that it's resizing the images to squares, so you're reducing the vertical resolution by a lot. I'll be adding support for fixed non-square images in a week or two, and that should this.
@dbolya i am also having the issue when i am trying to detect objects like poles, but by using fixed non square images how will this issue get resolved , can you please explain . if it solves the issue of vertical objects will there be effects of other objects detection
@dbolya @abhigoku10 I gain significant performance boost when I resize input images from resolution 1280x720 to 550x310 while keeping the aspect ratio unchanged (16:9). The rest of the input is padded with zeros (top and bottom of the RGB image).
Evaluation shows confidence of 0.40 for detection of the vertical object that previous had 0.05 confidence after 50000 steps. Some hyperparams adjustment can be made for fine-tuning of the network performance, but results are acceptable for my use-case.
UPDATE: These are evaluation results after 104000 steps (synthetic dataset created on principles of domain randomization)
| all | .50 | .55 | .60 | .65 | .70 | .75 | .80 | .85 | .90 | .95 |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
box | 83.24 | 99.91 | 98.93 | 98.93 | 98.93 | 95.46 | 94.32 | 91.25 | 85.16 | 63.39 | 6.15 |
mask | 74.72 | 91.56 | 90.06 | 88.67 | 86.02 | 79.68 | 79.11 | 74.94 | 64.90 | 53.05 | 39.20 |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
When inference is tested on real camera stream, long vertical objects is found with confidence of around 0.70
@mateoKutnjak this result is good , can you please elaborate the changes made to achieve it would be more helpful to me thanks in advance
I resized input images from 1280x720 to 550x310 (keeping the aspect ratio constant) and padded the rest of the image with zeros to get final image dimension of 550x550. I am inheriting yolact_plus_base_config with resnet101_dcn_inter_backbone. I have also set 'use_maskiou'=True, 'discard_mask area'=-1 and 'use_mask_scoring'=True.
resized input images from 1280x720 to 550x310 (keeping the aspect ratio constant) and padded the rest of the image with zeros to get final image dimension of 550x550.
Is this manual process any better than having yolact resize the images and keeping keepAspectRatio=True
?
Thank you for sharing your research on this topic!
I tried with keepAspectRatio=True with 1280x720 dimension and did not find any improvement.
I also suggest mask dilatation of long vertical objects with cv2.dilatate. As they appear wider, confidence is consistent despite different orientation, and can be later erorded with cv2.erosion. I am getting significantly better result with this approach.
@dbolya @mateoKutnjak i am facing a below error when i train only vertical objects with yolact++ , but i am not facing this error when i train with yolact . i have double checked the data annotations also
Multiple GPUs detected! Turning off JIT. Per-GPU batch size is less than the recommended limit for batch norm. Disabling batch norm. loading annotations into memory... Done (t=0.07s) creating index... index created! loading annotations into memory... Done (t=0.01s) creating index... index created! Initializing weights... Begin training!
[ 0] 0 || B: 5.665 | C: 11.077 | M: 7.510 | S: 1.452 | T: 25.704 || ETA: 13 days, 8:32:46 || timer: 5.770
[ 0] 10 || B: 5.379 | C: 6.902 | M: 6.530 | S: 1.207 | T: 20.018 || ETA: 2 days, 3:23:00 || timer: 0.407
[ 0] 20 || B: 5.169 | C: 5.997 | M: 5.993 | S: 0.855 | I: 0.042 | T: 18.056 || ETA: 1 day, 13:52:03 || timer: 0.409
[ 0] 30 || B: 4.984 | C: 5.593 | M: 5.372 | S: 0.612 | I: 0.029 | T: 16.590 || ETA: 1 day, 9:05:07 || timer: 0.410
[ 0] 40 || B: 4.929 | C: 5.291 | M: 5.084 | S: 0.480 | I: 0.022 | T: 15.806 || ETA: 1 day, 6:37:00 || timer: 0.418
[ 0] 50 || B: 4.853 | C: 5.052 | M: 4.879 | S: 0.397 | I: 0.017 | T: 15.198 || ETA: 1 day, 5:22:15 || timer: 0.414
[ 0] 60 || B: 4.795 | C: 4.847 | M: 4.715 | S: 0.339 | I: 0.014 | T: 14.710 || ETA: 1 day, 4:27:56 || timer: 0.412
[ 0] 70 || B: 4.800 | C: 4.672 | M: 4.622 | S: 0.297 | I: 0.012 | T: 14.404 || ETA: 1 day, 3:49:32 || timer: 0.433
[ 0] 80 || B: 4.789 | C: 4.527 | M: 4.525 | S: 0.267 | I: 0.011 | T: 14.119 || ETA: 1 day, 3:16:48 || timer: 0.415
[ 0] 90 || B: 4.774 | C: 4.388 | M: 4.485 | S: 0.243 | I: 0.010 | T: 13.900 || ETA: 1 day, 2:53:04 || timer: 0.420
[ 0] 100 || B: 4.752 | C: 4.208 | M: 4.384 | S: 0.210 | I: 0.009 | T: 13.563 || ETA: 1 day, 2:31:43 || timer: 0.426
[ 0] 110 || B: 4.705 | C: 3.876 | M: 4.147 | S: 0.097 | I: 0.002 | T: 12.827 || ETA: 1 day, 2:17:59 || timer: 0.429
[ 0] 120 || B: 4.651 | C: 3.685 | M: 3.991 | S: 0.055 | I: 0.001 | T: 12.382 || ETA: 1 day, 2:08:11 || timer: 0.439
[ 0] 130 || B: 4.634 | C: 3.506 | M: 3.957 | S: 0.049 | I: 0.000 | T: 12.147 || ETA: 1 day, 2:00:47 || timer: 0.442
[ 0] 140 || B: 4.624 | C: 3.365 | M: 3.917 | S: 0.045 | I: 0.000 | T: 11.951 || ETA: 1 day, 1:53:51 || timer: 0.430
[ 0] 150 || B: 4.611 | C: 3.242 | M: 3.907 | S: 0.044 | I: 0.000 | T: 11.804 || ETA: 1 day, 1:47:11 || timer: 0.435
[ 0] 160 || B: 4.626 | C: 3.147 | M: 3.900 | S: 0.044 | I: 0.000 | T: 11.717 || ETA: 1 day, 1:40:45 || timer: 0.424
[ 0] 170 || B: 4.588 | C: 3.076 | M: 3.895 | S: 0.044 | I: 0.000 | T: 11.603 || ETA: 1 day, 1:36:47 || timer: 0.439
[ 0] 180 || B: 4.581 | C: 3.027 | M: 3.912 | S: 0.043 | I: 0.000 | T: 11.563 || ETA: 1 day, 1:32:59 || timer: 0.446
[ 0] 190 || B: 4.559 | C: 2.992 | M: 3.865 | S: 0.042 | I: 0.000 | T: 11.458 || ETA: 1 day, 1:29:08 || timer: 0.446
[ 0] 200 || B: 4.543 | C: 2.958 | M: 3.825 | S: 0.041 | I: 0.000 | T: 11.367 || ETA: 1 day, 1:25:36 || timer: 0.455
Traceback (most recent call last):
File "train.py", line 504, in
@abhigoku10 I have the same problem when i train yolact++ with coco dataset. Do you fixed it.
@qjziyou yes i was able to fix it for custom dataset , you should not be able to get this error for coco dataset but i followed the #259
@abhigoku10 Thanks for your reply,#259 can solver my problem.
@dbolya @mateoKutnjak I have the same question. My pic' s size is about (600, 600), so the ratio of (h,w) is nearly 1, the pic is resized (550, 550), then ,
I am inheriting yolact_plus_base_config with resnet50_dcn_inter_backbone. I don't think that I set 'use_maskiou'=True, 'discard_mask area'=-1 and 'use_mask_scoring'=True is useful. beacause mask rescoring rcnn is to imporove mask quality, above image is not detected.
Is there a good solution???
@elfpattern Try widening the mask in the preprocessing (with kernel operations) and (optionally) narrowing in the posprocessing. This way confidence will be greater and object will be detected more easily.
@mateoKutnjak but by widening the mask in preprocessing will still end up in obtaining one more detections right at the top right corner , how to minimize this
If you want to find vertical crack more accurately you should widen the object mask of vertical object and not others. For top right corner mask I am not sure what is expected prediction so I cannot say what you should do about that.
@mateoKutnjak woay but you were mentioning of widening mask in post processing of inference or pre processing for training , isnce i am trying to detect poles but not successfull
@elfpattern Try widening the mask in the preprocessing (with kernel operations) and (optionally) narrowing in the posprocessing. This way confidence will be greater and object will be detected more easily.
Widen mask of vertical object in ground truth and feed it to the model
@mateoKutnjak get, I will try, but what is the intention?
@mateoKutnjak get, I will try, but what is the intention?
Easier detection and greater confidence. I was using this method to solve issue of detection tick on pressure gauge (when tick was in vertical position confidence dropped significantly and object could not be detected below some threshold).
@mateoKutnjak ok, Toady I try another idea, i set the acnhor 1:4, 1:1, 4:1, and it sucess, I will try your idea, Thx
@mateoKutnjak @dbolya Hi, thank you for your suggestion about resizing and zero padding the image.
I tried your idea and modified class Resize(object)
on augmentation.py
. This is my code.
class Resize(object):
""" Resize and pad with zeros to get a square image of size [max_dim, max_dim] """
@staticmethod
def calc_size_preserve_ar(img_w, img_h, max_size):
# Does it exceed max dim?
img_max = max(img_h, img_w)
scale = max_size / img_max
w = img_w * scale
h = img_h * scale
return int(w), int(h)
def __init__(self, resize_gt=True):
self.resize_gt = resize_gt
self.max_size = cfg.max_size
def __call__(self, image, masks, boxes, labels=None):
img_h, img_w, depth = image.shape
width, height = Resize.calc_size_preserve_ar(img_w, img_h, self.max_size)
image = cv2.resize(image, (width, height))
top_pad = random.uniform(0, (self.max_size - height) // 2)
left_pad = random.uniform(0, (self.max_size - width) // 2)
expand_image = np.zeros(
(int(self.max_size), int(self.max_size), depth),
dtype=image.dtype)
expand_image[int(top_pad):int(top_pad + height),
int(left_pad):int(left_pad + width)] = image
image = expand_image
if self.resize_gt:
masks = masks.transpose((1, 2, 0))
masks = cv2.resize(masks, (width, height))
# OpenCV resizes a (w,h,1) array to (s,s), so fix that
if len(masks.shape) == 2:
masks = np.expand_dims(masks, 0)
else:
masks = masks.transpose((2, 0, 1))
expand_masks = np.zeros(
(masks.shape[0], int(self.max_size), int(self.max_size)),
dtype=masks.dtype)
expand_masks[:,int(top_pad):int(top_pad + height),
int(left_pad):int(left_pad + width)] = masks
masks = expand_masks
# extract boxes from masks
boxes = boxes.copy()
boxes = extract_bboxes(masks)
# Discard boxes that are smaller than we'd like
w = boxes[:, 2] - boxes[:, 0]
h = boxes[:, 3] - boxes[:, 1]
keep = (w > cfg.discard_box_width) * (h > cfg.discard_box_height)
masks = masks[keep]
boxes = boxes[keep]
labels['labels'] = labels['labels'][keep]
labels['num_crowds'] = (labels['labels'] < 0).sum()
return image, masks, boxes, labels
But I think there's a bug in my code because I get this weird val mAP. I suspect it's somewhere when self.resize_gt=False
since it affects class BaseTransform(object)
, a transform to be used when evaluating.
Calculating mAP...
| all | .50 | .55 | .60 | .65 | .70 | .75 | .80 | .85 | .90 | .95 |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
box | 1.07 | 4.73 | 2.85 | 1.55 | 0.83 | 0.43 | 0.20 | 0.12 | 0.02 | 0.00 | 0.00 |
mask | 0.39 | 1.75 | 1.03 | 0.60 | 0.31 | 0.15 | 0.07 | 0.02 | 0.00 | 0.00 | 0.00 |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
Could you share your code? Or maybe do you have any suggestions to fix my code?
Thank you.
@mateoKutnjak ok, Toady I try another idea, i set the acnhor 1:4, 1:1, 4:1, and it sucess, I will try your idea, Thx
you generated the anchors only for long vertical objects or the training set had other objects which had width also
@VinniaKemala My dataset generation consist of Blender rendering and converting raw mask and RGB to COCO format. Here is my code for resizing raw RGB and mask before doing further conversion to COCO format with pycocotools)
def resize(rgb, mask, w, h):
original_h = rgb.shape[0]
original_w = rgb.shape[1]
percent_decrease_w = w/rgb.shape[1]
percent_decrease_h = h/rgb.shape[0]
min_decrease = min(percent_decrease_h, percent_decrease_w)
new_w = round(original_w*min_decrease / 2)*2
new_h = round(original_h*min_decrease / 2)*2
rgb = cv2.resize(rgb, (new_w, new_h), interpolation=cv2.INTER_AREA)
mask = cv2.resize(mask, (new_w, new_h), interpolation=cv2.INTER_AREA)
vertical_pad = max(0, h-new_h) // 2
horizontal_pad = max(0, w-new_w) // 2
rgb = np.pad(rgb, ((vertical_pad, vertical_pad), (horizontal_pad, horizontal_pad), (0, 0)),
mode='constant', constant_values=0)
mask = np.pad(mask, ((vertical_pad, vertical_pad), (horizontal_pad, horizontal_pad)),
mode='constant', constant_values=0)
return rgb, mask
I have to repeat that I am not doing resizing in DataLoader class. I am resizing images in my dataset.
I also suggest mask dilatation of long vertical objects with cv2.dilatate. As they appear wider, confidence is consistent despite different orientation, and can be later erorded with cv2.erosion. I am getting significantly better result with this approach.
Could I ask how you revise your mask annotation of train data when your change your train images using cv2.dialte or cv2,erode?
@mateoKutnjak ok, Toady I try another idea, i set the acnhor 1:4, 1:1, 4:1, and it sucess, I will try your idea, Thx
what about 1:3 1:1 3:1
I also suggest mask dilatation of long vertical objects with cv2.dilatate. As they appear wider, confidence is consistent despite different orientation, and can be later erorded with cv2.erosion. I am getting significantly better result with this approach.
Could I ask how you revise your mask annotation of train data when your change your train images using cv2.dialte or cv2,erode?
Extract mask of thin vertical object where background is equal to zero and object mask is greater then 0. Perform cv2.dilate on this mask. Now you have new raw mask. Use pycocotools to turn this mask to COCO format. Follow this guide: https://www.immersivelimit.com/tutorials/create-coco-annotations-from-scratch
Feed this mask as ground truth to network and train the network.
Optionally, when training is finished and you are doing inference only, you can do erosion of mask got from inference because mask trained on dilated mask will provide dilated prediction. In my use case it was not necessary to perform erosion. This is why I say it is optional.
I tried above mentioned suggestions with four different scenarios. My input image are 1280x720 except in the resized scenario. By only changing the aspect ratio the best result was obtained. This ratio was based on a rough guess. I am still looking for an optimal way to determine the aspect ratio. k-means clustering did not work because of outliers. Did anyone tried to optimise the parameters?
Normal images with normal aspect ratio: | all | .50 | .55 | .60 | .65 | .70 | .75 | .80 | .85 | .90 | .95 | -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+ box | 16.42 | 44.07 | 35.91 | 29.89 | 25.04 | 15.77 | 8.10 | 3.90 | 1.50 | 0.01 | 0.00 | mask | 7.77 | 24.70 | 19.62 | 15.58 | 8.66 | 5.87 | 3.15 | 0.11 | 0.00 | 0.00 | 0.00 | -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
Normal images with aspect ratio 0.1 0.5 1: | all | .50 | .55 | .60 | .65 | .70 | .75 | .80 | .85 | .90 | .95 | -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+ box | 25.25 | 57.82 | 53.88 | 44.92 | 39.15 | 30.71 | 19.17 | 4.89 | 1.65 | 0.29 | 0.00 | mask | 15.64 | 42.77 | 39.41 | 32.04 | 22.22 | 13.09 | 5.52 | 1.31 | 0.00 | 0.00 | 0.00 |
Resized images and padding (similar as method of @mateoKutnjak) with normal aspect ratio: | all | .50 | .55 | .60 | .65 | .70 | .75 | .80 | .85 | .90 | .95 | -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+ box | 18.69 | 39.36 | 36.44 | 32.55 | 29.05 | 22.73 | 13.95 | 7.99 | 4.27 | 0.59 | 0.00 | mask | 3.09 | 11.53 | 8.67 | 5.21 | 3.92 | 1.23 | 0.28 | 0.02 | 0.00 | 0.00 | 0.00 |
Resized images and padding (similar as method of @mateoKutnjak) with aspect ratio 0.1 0.5 1: | all | .50 | .55 | .60 | .65 | .70 | .75 | .80 | .85 | .90 | .95 | -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+ box | 26.61 | 53.97 | 50.93 | 44.19 | 35.90 | 32.36 | 25.74 | 15.03 | 5.50 | 2.44 | 0.00 | mask | 8.88 | 27.31 | 21.80 | 19.67 | 12.77 | 5.58 | 1.65 | 0.00 | 0.00 | 0.00 | 0.00 | -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
Hi. Thank you for your work. Speed and accuracy of YOLACT++ is really amazing.
I have a problem with detecting long and vertical object. Although object detection is not an issue when object's orientation is different, network has problem detecting the object when it is vertically aligned with y axis of the camera.
I believe it is due to configurations in config.py but am unable to find the necessary changes to address the issue.
Please provide your thoughts on this behavior. Thank you in advance.
EDIT: This is result when object is vertical. It has confidence barely above 0.0, but with different orientations has confidence more than 0.50 (this data depicts metrics after 10000 training steps)