facebookresearch / maskrcnn-benchmark

Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch.
MIT License
9.29k stars 2.5k forks source link

Uniform representation of masks in predictions and annotations #527

Open botcs opened 5 years ago

botcs commented 5 years ago

🚀 Feature

Uniform representation of masks in predictions and annotations

Motivation

As it became apparent, for training it is required to add mask annotations using target.add_field("masks", masks) where, target is an instance of the BoxList class, and the masks object is an instance of SegmentationMask, and must be able to represent the annotation as a binary mask of the whole image.

In contrast, the predictions returned by the model comes also as an instance of BoxList but does not have an extra field "masks", instead it has "mask". Also, the prediction.get_field("mask") is not a SegmentationMask instance, but a torch tensor of the BinaryMask predictions for a single image, in uniform size (28x28 by default) which should be resized each by each accordingly to the position and size specified by prediction.bbox which is a tensor of [N, 4] containing coordinates for each instance in the prediction.

Pitch

This should be unified for clarity and efficiency:

Additional context

img

fmassa commented 5 years ago

Hi @botcs

Great question and proposal! Here are my thoughts about it:

While I agree that having this inconsistency is not great, let me explain my original reasoning. I initially wanted to have the same representation for both the input of the model and the output, so that this inconsistency is gone. This means that the model would output a tensor of size image_height x image_width, with the masks pasted into it already. This could also be a represented as a Mask instance for convenience.

The reason why I don't do it like that, but instead perform the "mask pasting" during evaluation, is that we generally resize the images during training to a potentially larger size. As the "mask pasting in the image" is an expensive operation, if done on higher resolution images, it will take longer to compute, and this is just wasting resources as we need to rescale the predictions back to the original image size for computing the evaluation.

One way to avoid this would be to pass the original image size to the model's forward, but I find this very ugly and unintuitive.

A solution that I like more is to make the mask pasting to run entirely on the GPU in a batched way, which means it doesn't matter much to run this computation on a larger image, as the runtime will be very similar. This is effectively what I've originally implemented in an early version, what what has been submitted as a PR in https://github.com/facebookresearch/maskrcnn-benchmark/pull/129 (I also have a CUDA kernel somewhere). The drawback with that implementation is that it doesn't give exactly the same results as the current approach (mostly in the boundaries), which leads to ~2mAP lower results during evaluation.

About masks / mask -> yes, this inconsistency is not great, and I believe it could be changed without problems.

The annotations, when converted to binary masks during training should not represent the whole image, just the internal values of the bounding box.

I'm less sure about this direction. The reason being that it's harder to perform image transformations (resize, crop, etc) on the internal values of the masks. It might be possible to do them, it's just more complicated / noisy maybe?

botcs commented 5 years ago

Hi @fmassa,

Thank you for the insights! This is indeed a not-so-trivial issue to solve, so I would be happy to discuss a few of my concerns.

Could you please clarify a bit, I think I could not follow a few things

The reason why I don't do it like that, but instead perform the "mask pasting" during evaluation, is that we generally resize the images during training to a potentially larger size.

So let's call the image_height x image_width binary masks as fullmask, and pastemask the "mask pasting". Now the reason to not go with fullmask is because images are potentially upscaled, which is unwanted because? (Storing the upscaled fullmask is expensive?)

As the "mask pasting in the image" is an expensive operation, if done on higher resolution images, it will take longer to compute, and this is just wasting resources as we need to rescale the predictions back to the original image size for computing the evaluation.

So if I understood correctly, then storing the predictions in a uniform (e.g. [num_inst, 1, 28, 28]) tensor is favored because it would be expensive to paste the predictions extracted from the roi_mask head on a large image (e.g. [600, 800]) when the image will be downscaled anyways to its original size (e.g. [100, 200]) by the evaluation script?

If we are on the same page, than I completely agree: I am in the pastemask camp as well, it seems to be the most efficient, but I will elaborate on this point later.


One way to avoid this would be to pass the original image size to the model's forward, but I find this very ugly and unintuitive.

Completely agree.


A solution that I like more is to make the mask pasting to run entirely on the GPU in a batched way, which means it doesn't matter much to run this computation on a larger image, as the runtime will be very similar. This is effectively what I've originally implemented in an early version, what what has been submitted as a PR in #129 (I also have a CUDA kernel somewhere). The drawback with that implementation is that it doesn't give exactly the same results as the current approach (mostly in the boundaries), which leads to ~2mAP lower results during evaluation.

I have witnessed the artifacts myself when experimented with giving tiny (2x2) masks to the Masker to paste them onto a (50x50) box, and the edges were extremely rounded, the mask almost looked like a circle (is this the effect of the bilinear upsampling?). About the drawback I think all these performance upgrades issued should not affect the performance at all, even if the upgrade would mean a serious advantage in time complexity. I have the bad experience of stacking all these optimizations onto top of each other until I end up with something extremely fast and efficient that is super unstable :smile:


The annotations, when converted to binary masks during training should not represent the whole image, just the internal values of the bounding box.

I'm less sure about this direction. The reason being that it's harder to perform image transformations (resize, crop, etc) on the internal values of the masks. It might be possible to do them, it's just more complicated / noisy maybe?

Definitely pastemask would be less clean, and involve more hacks than fullmask. 2 out of 3 transformations would be quite trivial to implement:


Finally a few other possible issues came up when writing: