dbolya / yolact

A simple, fully convolutional model for real-time instance segmentation.
MIT License
5.03k stars 1.32k forks source link

Detection and segmentation many small objects #613

Open susanin1970 opened 3 years ago

susanin1970 commented 3 years ago

Hello!

Thanks for a this great repo :)

In the article and on the main page of the repository, there are examples of using YOLACT on images that contain a small number of objects

I tried to train YOLACT on full marked-up image that contain hundreds of objects, as in the example below 00310

For training I use GPU RTX 2080 Ti When I start training, sometimes the following happens:

(TorchEnvLpr) PS C:\Users\Reutov\Repository\yolact> python train.py --config=yolact_resnet50_grans_config --batch_size=1 --num_workers=0
Scaling parameters by 0.12 to account for a batch size of 1.
Per-GPU batch size is less than the recommended limit for batch norm. Disabling batch norm.
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
C:\Users\Reutov\.conda\envs\TorchEnvLpr\lib\site-packages\torch\jit\_recursive.py:182: UserWarning: 'lat_layers' was found in ScriptModule constants,  but it is a non-constant submodule. Consider removing it.
  " but it is a non-constant {}. Consider removing it.".format(name, hint))
C:\Users\Reutov\.conda\envs\TorchEnvLpr\lib\site-packages\torch\jit\_recursive.py:182: UserWarning: 'pred_layers' was found in ScriptModule constants,  but it is a non-constant submodule. Consider removing it.
  " but it is a non-constant {}. Consider removing it.".format(name, hint))
C:\Users\Reutov\.conda\envs\TorchEnvLpr\lib\site-packages\torch\jit\_recursive.py:182: UserWarning: 'downsample_layers' was found in ScriptModule constants,  but it is a non-constant submodule. Consider removing it.
  " but it is a non-constant {}. Consider removing it.".format(name, hint))
Initializing weights...
Begin training!

(TorchEnvLpr) PS C:\Users\Reutov\Repository\yolact>

The training doesn't begin at all

Sometimes I get OpenCV error:

(TorchEnvLpr) PS C:\Users\Reutov\Repository\yolact> python train.py --config=yolact_resnet50_grans_config --batch_size=1 --num_workers=0
Scaling parameters by 0.12 to account for a batch size of 1.
Per-GPU batch size is less than the recommended limit for batch norm. Disabling batch norm.
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
C:\Users\Reutov\.conda\envs\TorchEnvLpr\lib\site-packages\torch\jit\_recursive.py:182: UserWarning: 'lat_layers' was found in ScriptModule constants,  but it is a non-constant submodule. Consider removing it.
  " but it is a non-constant {}. Consider removing it.".format(name, hint))
C:\Users\Reutov\.conda\envs\TorchEnvLpr\lib\site-packages\torch\jit\_recursive.py:182: UserWarning: 'pred_layers' was found in ScriptModule constants,  but it is a non-constant submodule. Consider removing it.
  " but it is a non-constant {}. Consider removing it.".format(name, hint))
C:\Users\Reutov\.conda\envs\TorchEnvLpr\lib\site-packages\torch\jit\_recursive.py:182: UserWarning: 'downsample_layers' was found in ScriptModule constants,  but it is a non-constant submodule. Consider removing it.
  " but it is a non-constant {}. Consider removing it.".format(name, hint))
Initializing weights...
Begin training!

OpenCV Error: Assertion failed (dsize.area() > 0) in cv::hal::resize, file C:\ci\opencv_1512688052760\work\modules\imgproc\src\resize.cpp, line 2961
Traceback (most recent call last):
  File "train.py", line 504, in <module>
    train()
  File "train.py", line 270, in train
    for datum in data_loader:
  File "C:\Users\Reutov\.conda\envs\TorchEnvLpr\lib\site-packages\torch\utils\data\dataloader.py", line 435, in __next__
    data = self._next_data()
  File "C:\Users\Reutov\.conda\envs\TorchEnvLpr\lib\site-packages\torch\utils\data\dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "C:\Users\Reutov\.conda\envs\TorchEnvLpr\lib\site-packages\torch\utils\data\_utils\fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Users\Reutov\.conda\envs\TorchEnvLpr\lib\site-packages\torch\utils\data\_utils\fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Users\Reutov\Repository\yolact\data\coco.py", line 94, in __getitem__
    im, gt, masks, h, w, num_crowds = self.pull_item(index)
  File "C:\Users\Reutov\Repository\yolact\data\coco.py", line 159, in pull_item
    {'num_crowds': num_crowds, 'labels': target[:, 4]})
  File "C:\Users\Reutov\Repository\yolact\utils\augmentations.py", line 688, in __call__
    return self.augment(img, masks, boxes, labels)
  File "C:\Users\Reutov\Repository\yolact\utils\augmentations.py", line 55, in __call__
    img, masks, boxes, labels = t(img, masks, boxes, labels)
  File "C:\Users\Reutov\Repository\yolact\utils\augmentations.py", line 158, in __call__
    masks = cv2.resize(masks, (width, height))
cv2.error: C:\ci\opencv_1512688052760\work\modules\imgproc\src\resize.cpp:2961: error: (-215) dsize.area() > 0 in function cv::hal::resize

I assume that there is a limit on the number of placed objects in the frame that YOLACT can be trained on. Also I think there is a problem in the markup. But I want to ask, how can I solve this problem And how effective is it to use YOLACT to detect and segment many small objects in the images

Thanks in advance for answer :)

fanweiya commented 3 years ago

I have the same problem,and how do you solve it?

susanin1970 commented 3 years ago

No, I didn't solve it yet, but I'm going to solve in the future Now I try to use Detectron2 library for this task

tehkillerbee commented 3 years ago

I have experienced the same issue when training with datasets containing many (>200) small objects in the same image. The training begins but eventually, the same opencv error is returned. Unfortunately, I have not found a solution.

InvincibleKnight commented 3 years ago

Hi @susanin1970, @fanweiya, @tehkillerbee I am planning to work on something similar. Is there a workaround?

@susanin1970 did the Detectron2 library work for you?

Thank you!

tehkillerbee commented 3 years ago

@InvincibleKnight I ended up using mmdetection and was able to train and run inference on a dataset containing many small objects (although I also had some challenges)

susanin1970 commented 3 years ago

@InvincibleKnight, hi! Detectron2 library was worked for me, but there were some nuances

I use config COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml for training I changed this config in the following way:

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("gorokh_train",) # here is the example of name of train set
cfg.DATASETS.TEST = ("gorokh_val",) # here is the example of name of test set
cfg.DATALOADER.NUM_WORKERS = 2
# Let training initialize from model zoo
# cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml")  
cfg.MODEL.WEIGHTS = "/content/drive/MyDrive/trash/model_0004999.pth" 
cfg.SOLVER.IMS_PER_BATCH = 1
cfg.SOLVER.BASE_LR = 0.0001 
cfg.SOLVER.MAX_ITER = 30000  
cfg.SOLVER.STEPS = [] 
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 256   # default: 512
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1
cfg.TEST.DETECTIONS_PER_IMAGE = 3000

I trained Mask-RCNN on two different datasets One of them contained marked frames in CVAT, which are identical to those shown in the first post, one frame contains about 1000 objects (in fact, it's peas) Another dataset contains marked frames in CVAT, which contain several dozen objects maximum ( e.g. cat litter)

I trained Mask-RCNN on ~30k epochs in Google Colab (but I recetnly able to install Detectron2 on Win10) in cat litter dataset, and inferenced this model on the test frames and test video Mask-RCNN segmented almost all objects in frames:

image image

But inference on video lasted ~10-15 minutes And I measured the processing time of one frame. It took an average of 1500 milliseconds, and I think, that this is clearly not enough for real-time work for example Maybe measurements were not entirely correct, because I measured processing time with help time module in python

For training Mask-RCNN in peas dataset I use several iterations in Colab, in each iteration was ~30k epochs And results were not as good, s barely 1/5 of all objects was egmented and inference time on single image was 1700-1800 milliseconds:

image

Fewer objects were initially segmented, but I noticed, that the value of cfg.TEST.DETECTIONS_PER_IMAGE is 100 by default, but change of this parameter upwards (e. g. 1000, 2000, 3000) did not improve the situation

susanin1970 commented 3 years ago

@tehkillerbee , hi! How many objects contain in the frames of your dataset?

ys31jp commented 3 years ago

@InvincibleKnight, hi! Detectron2 library was worked for me, but there were some nuances

I use config COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml for training I changed this config in the following way:

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("gorokh_train",) # here is the example of name of train set
cfg.DATASETS.TEST = ("gorokh_val",) # here is the example of name of test set
cfg.DATALOADER.NUM_WORKERS = 2
# Let training initialize from model zoo
# cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml")  
cfg.MODEL.WEIGHTS = "/content/drive/MyDrive/trash/model_0004999.pth" 
cfg.SOLVER.IMS_PER_BATCH = 1
cfg.SOLVER.BASE_LR = 0.0001 
cfg.SOLVER.MAX_ITER = 30000  
cfg.SOLVER.STEPS = [] 
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 256   # default: 512
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1
cfg.TEST.DETECTIONS_PER_IMAGE = 3000

I trained Mask-RCNN on two different datasets One of them contained marked frames in CVAT, which are identical to those shown in the first post, one frame contains about 1000 objects (in fact, it's peas) Another dataset contains marked frames in CVAT, which contain several dozen objects maximum ( e.g. cat litter)

I trained Mask-RCNN on ~30k epochs in Google Colab (but I recetnly able to install Detectron2 on Win10) in cat litter dataset, and inferenced this model on the test frames and test video Mask-RCNN segmented almost all objects in frames:

image image

But inference on video lasted ~10-15 minutes And I measured the processing time of one frame. It took an average of 1500 milliseconds, and I think, that this is clearly not enough for real-time work for example Maybe measurements were not entirely correct, because I measured processing time with help time module in python

For training Mask-RCNN in peas dataset I use several iterations in Colab, in each iteration was ~30k epochs And results were not as good, s barely 1/5 of all objects was egmented and inference time on single image was 1700-1800 milliseconds:

image

Fewer objects were initially segmented, but I noticed, that the value of cfg.TEST.DETECTIONS_PER_IMAGE is 100 by default, but change of this parameter upwards (e. g. 1000, 2000, 3000) did not improve the situation

hi, I am interest in Detectron2 as well, is it available for customs dataset? Thank you

alexeybozhchenko commented 3 years ago

@tehkillerbee which model from mmdetection did you use to get nice result?

tehkillerbee commented 3 years ago

@alexeybozhchenko I ended up using Mask RCNN but I had to tweak some parameters to detect a larger number of objects. Specifically,

  1. Change maximum number of detections Increasing this value is required to return more than 100 results. Increasing it will result in higher memory usage.
    edit configs/_base_/models/mask_rcnn_r50_fpn.py
    ...
    max_per_img=100 => max_per_img=1000

Additionally my model struggled with very small objects. This is due to the anchor scale used for COCO.

  1. Reduce min object size that can be detected Anchor scale is used to set the minimum ROI used for detection. Default is 32x32 as the smallest object. This makes detection of small objects challenging

Anchor scale is calculated as anchor_scales anchor_base_sizes, if anchor_base_sizes is not set, anchor_strides is used by default. If anchor_scales=[8] and anchor_strides=[4, 8, 16, 32, 64], then anchor scales for each fpn level are calculated as [84, 88, 816, 832, 864].

edit configs/_base_/models/mask_rcnn_r50_fpn.py
...

You just need to modify anchor_scales=[8] to anchor_scales=[4].

See https://github.com/open-mmlab/mmdetection/issues/90 for more details

Please note that detecting this many objects require quite a bit of GPU RAM. I am using a Jetson AGX Xavier with 32GB RAM (shared between CPU/GPU)

tehkillerbee commented 3 years ago

@susanin1970 Sorry for late reply, I detect around 1000 in my dataset. Each frame is scaled to 768x768.

The inference speed depends on the number of objects but it is usually fast enough for my requirements (1000mS or less)

susanin1970 commented 3 years ago

@InvincibleKnight, hi! Detectron2 library was worked for me, but there were some nuances I use config COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml for training I changed this config in the following way:

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("gorokh_train",) # here is the example of name of train set
cfg.DATASETS.TEST = ("gorokh_val",) # here is the example of name of test set
cfg.DATALOADER.NUM_WORKERS = 2
# Let training initialize from model zoo
# cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml")  
cfg.MODEL.WEIGHTS = "/content/drive/MyDrive/trash/model_0004999.pth" 
cfg.SOLVER.IMS_PER_BATCH = 1
cfg.SOLVER.BASE_LR = 0.0001 
cfg.SOLVER.MAX_ITER = 30000  
cfg.SOLVER.STEPS = [] 
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 256   # default: 512
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1
cfg.TEST.DETECTIONS_PER_IMAGE = 3000

I trained Mask-RCNN on two different datasets One of them contained marked frames in CVAT, which are identical to those shown in the first post, one frame contains about 1000 objects (in fact, it's peas) Another dataset contains marked frames in CVAT, which contain several dozen objects maximum ( e.g. cat litter) I trained Mask-RCNN on ~30k epochs in Google Colab (but I recetnly able to install Detectron2 on Win10) in cat litter dataset, and inferenced this model on the test frames and test video Mask-RCNN segmented almost all objects in frames: image image But inference on video lasted ~10-15 minutes And I measured the processing time of one frame. It took an average of 1500 milliseconds, and I think, that this is clearly not enough for real-time work for example Maybe measurements were not entirely correct, because I measured processing time with help time module in python For training Mask-RCNN in peas dataset I use several iterations in Colab, in each iteration was ~30k epochs And results were not as good, s barely 1/5 of all objects was egmented and inference time on single image was 1700-1800 milliseconds: image Fewer objects were initially segmented, but I noticed, that the value of cfg.TEST.DETECTIONS_PER_IMAGE is 100 by default, but change of this parameter upwards (e. g. 1000, 2000, 3000) did not improve the situation

hi, I am interest in Detectron2 as well, is it available for customs dataset? Thank you

Sorry for late reply. Yes, there is a possibility of use Detectron2 for detection/segmentation of custom objects

susanin1970 commented 3 years ago

@susanin1970 Sorry for late reply, I detect around 1000 in my dataset. Each frame is scaled to 768x768.

The inference speed depends on the number of objects but it is usually fast enough for my requirements (1000mS or less)

Thank you for this detailed comment! I will try to use MMDetection for segmentation small objects based on your experience

VABer-dv commented 2 years ago

if number of objects at the image is more than 512 open cv resize method crushes. To solve this problem modify the line masks = cv2.resize(masks, (width, height)) at /yolact-master/utils/augmentations.py, (~ line 162):

cv_limit = 512
if masks.shape[2] <= cv_limit:
    masks = cv2.resize(masks, (width, height))
else:
    # split masks array on batches with max size 512 along channel axis, resize and merge them back
    masks = np.concatenate([cv2.resize(masks[:, :, i:min(i + cv_limit, masks.shape[2])], (width, height))
                            for i in range(0, masks.shape[2], cv_limit)], axis=2)
habjoel commented 2 years ago

Hey @tehkillerbee, happy to find you in this issue. We have been in contact on an MMDeploy Github issue regarding the deployment of MMDetection Mask-RCNN models on the Jetson.

TL;DR

I essentially would like to find out if and to what extend I need to retrain the Mask-RCNN (COCO pretrained) model if I want to use different image scales and anchor sizes than the default in MMDetection.

Problem Description

Reading through this issue, I have a question that you may be able to answer to me. I want to use a 3.1MP camera (2064x1544) together with the Mask-RCNN model from MMDetection. I realized that every image gets rescaled for the dimensions to be within a range of [800, 1333] before training/inference. However, I would really like to use the full resolution that my camera offers! I therefore changed the img_scale parameter on this line and this line to img_scale = (2064, 1544). (Is that the correct way to do it?)

Before realizing the resizing that MMDetection does, I have simply taken the pretrained COCO weights and did some transfer learning for the network to detect 7 classes of litter (keeping the resizing to the default) using the TACO dataset and some own images (in total about 1600). I then exported the model to TensorRT and realized that the inference performs really bad on my non-rescaled 3.1MP images but very well on images rescaled down to [800, 1333]. I am now obviously trying to achieve that performance for my 3.1MP images without rescaling!

So, simply changing the img_scale on my already trained model (that has only seen rescaled images) for inference does not work. The guys from MMDetection have suggested to me to change the anchor size by changing this line from scales=[8] to scales=[12] or even scales=[16]. It does not work either.

I therefore thought that I might need to do some proper retraining of the model with images of size (2064, 1544) and maybe different anchor sizes. (Is that true?) How should I do this? Should I train the model from scratch without COCO pretraining? Should I keep the COCO pretraining and simply do transfer learning with larger image sizes? I think training everyhing from scratch would require a huge dataset (such as COCO) anyways and it would cost me a lot of time without knowing if it would work.

So ideally, I am looking for a way to keep the COCO pretraining (for low-level features) and retrain the model (especially the RPN) to handle my 3.1MP images.

Maybe you could give me some advice since I see that you have set a different anchor size to the Mask-RCNN model! Did you need to retrain? How did you do it? Did you use the pretrained weights from COCO? (How large is your dataset and for how many epochs did you train it?)

Thanks a lot four your help!

tehkillerbee commented 2 years ago

Hi @habjoel,

Interesting to see that you have similar challenges when detecting litter. Since this issue is not strictly related to yolact, lets get in touch on Linkedin (I have sent you a connection request), then I can try to give you some pointers.