Uniform representation of masks in predictions and annotations

botcs commented 5 years ago

🚀 Feature

Motivation

As it became apparent, for training it is required to add mask annotations using target.add_field("masks", masks) where, target is an instance of the BoxList class, and the masks object is an instance of SegmentationMask, and must be able to represent the annotation as a binary mask of the whole image.

In contrast, the predictions returned by the model comes also as an instance of BoxList but does not have an extra field "masks", instead it has "mask". Also, the prediction.get_field("mask") is not a SegmentationMask instance, but a torch tensor of the BinaryMask predictions for a single image, in uniform size (28x28 by default) which should be resized each by each accordingly to the position and size specified by prediction.bbox which is a tensor of [N, 4] containing coordinates for each instance in the prediction.

Pitch

This should be unified for clarity and efficiency:

The annotations, when converted to binary masks during training should not represent the whole image, just the internal values of the bounding box. Modify accordingly the SegmentationMask and the training.
The predictions should also use the SegmentationMask format, and store its values in the same field as the annotations in a BoxList instance.

Additional context

fmassa commented 5 years ago

Hi @botcs

Great question and proposal! Here are my thoughts about it:

While I agree that having this inconsistency is not great, let me explain my original reasoning. I initially wanted to have the same representation for both the input of the model and the output, so that this inconsistency is gone. This means that the model would output a tensor of size image_height x image_width, with the masks pasted into it already. This could also be a represented as a Mask instance for convenience.

The reason why I don't do it like that, but instead perform the "mask pasting" during evaluation, is that we generally resize the images during training to a potentially larger size. As the "mask pasting in the image" is an expensive operation, if done on higher resolution images, it will take longer to compute, and this is just wasting resources as we need to rescale the predictions back to the original image size for computing the evaluation.

One way to avoid this would be to pass the original image size to the model's forward, but I find this very ugly and unintuitive.

A solution that I like more is to make the mask pasting to run entirely on the GPU in a batched way, which means it doesn't matter much to run this computation on a larger image, as the runtime will be very similar. This is effectively what I've originally implemented in an early version, what what has been submitted as a PR in https://github.com/facebookresearch/maskrcnn-benchmark/pull/129 (I also have a CUDA kernel somewhere). The drawback with that implementation is that it doesn't give exactly the same results as the current approach (mostly in the boundaries), which leads to ~2mAP lower results during evaluation.

About masks / mask -> yes, this inconsistency is not great, and I believe it could be changed without problems.

The annotations, when converted to binary masks during training should not represent the whole image, just the internal values of the bounding box.

I'm less sure about this direction. The reason being that it's harder to perform image transformations (resize, crop, etc) on the internal values of the masks. It might be possible to do them, it's just more complicated / noisy maybe?

botcs commented 5 years ago

Hi @fmassa,

Thank you for the insights! This is indeed a not-so-trivial issue to solve, so I would be happy to discuss a few of my concerns.

Could you please clarify a bit, I think I could not follow a few things

The reason why I don't do it like that, but instead perform the "mask pasting" during evaluation, is that we generally resize the images during training to a potentially larger size.

So let's call the image_height x image_width binary masks as fullmask, and pastemask the "mask pasting". Now the reason to not go with fullmask is because images are potentially upscaled, which is unwanted because? (Storing the upscaled fullmask is expensive?)

As the "mask pasting in the image" is an expensive operation, if done on higher resolution images, it will take longer to compute, and this is just wasting resources as we need to rescale the predictions back to the original image size for computing the evaluation.

So if I understood correctly, then storing the predictions in a uniform (e.g. [num_inst, 1, 28, 28]) tensor is favored because it would be expensive to paste the predictions extracted from the roi_mask head on a large image (e.g. [600, 800]) when the image will be downscaled anyways to its original size (e.g. [100, 200]) by the evaluation script?

If we are on the same page, than I completely agree: I am in the pastemask camp as well, it seems to be the most efficient, but I will elaborate on this point later.

One way to avoid this would be to pass the original image size to the model's forward, but I find this very ugly and unintuitive.

Completely agree.

A solution that I like more is to make the mask pasting to run entirely on the GPU in a batched way, which means it doesn't matter much to run this computation on a larger image, as the runtime will be very similar. This is effectively what I've originally implemented in an early version, what what has been submitted as a PR in #129 (I also have a CUDA kernel somewhere). The drawback with that implementation is that it doesn't give exactly the same results as the current approach (mostly in the boundaries), which leads to ~2mAP lower results during evaluation.

I have witnessed the artifacts myself when experimented with giving tiny (2x2) masks to the Masker to paste them onto a (50x50) box, and the edges were extremely rounded, the mask almost looked like a circle (is this the effect of the bilinear upsampling?). About the drawback I think all these performance upgrades issued should not affect the performance at all, even if the upgrade would mean a serious advantage in time complexity. I have the bad experience of stacking all these optimizations onto top of each other until I end up with something extremely fast and efficient that is super unstable :smile:

The annotations, when converted to binary masks during training should not represent the whole image, just the internal values of the bounding box.

I'm less sure about this direction. The reason being that it's harder to perform image transformations (resize, crop, etc) on the internal values of the masks. It might be possible to do them, it's just more complicated / noisy maybe?

Definitely pastemask would be less clean, and involve more hacks than fullmask. 2 out of 3 transformations would be quite trivial to implement:

resize would not affect the internal masks at all, practically just modifying the .size field, while the paste operation would take effect only when the .get_mask_tensor() function is called (introduced in #473), therefore out of multiple resize calls only the last before pasting would be evaluated.
transpose would just flip the mask accordingly.
crop is where things could go wrong, mostly because we either:
- accumulate all the operations before .get_mask_tensor() is called and then compute what would be the final image look like, where would be the bounding box, and which portion of the bounding box would be visible, and finally do the paste
- at each call of the .crop() we would call internally .get_mask_tensor() before cropping and simply slice the cropbox out of the fullmask.

Finally a few other possible issues came up when writing:

What should be the result when we crop from a SegmentationMask, which holds multiple instances, and one or more instances would be completely missing from the cropped mask? Are we still listing and labeling an empty mask, does that make sense? Should we check and remove empty entries after each crop from the SegmentationMask object? This does not happen at all at the moment since (even though SegmentationMask is capable of handling multiple instances) in the current implementation the per-image-masks are always sliced to a single instance mask.
Given that currently instances in SegmentationMask objects are treated one by one, it gives place for an implementation where the tensors in pastemasks could have different shapes and sizes. So instead of a contiguous tensor of shape [num_inst, 1, 150, 150] (150 is just a made common size here) which remains unexploited for its, the masks could be stored in a list as separate tensors. Of course the other alternative is to store them in the same way predictions are stored, and find a common width and height that would accomodate all binary masks.

facebookresearch / maskrcnn-benchmark