How to train Mask2Former from a COCO json custom dataset?

NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.

MIT License

8.51k stars 1.34k forks source link

How to train Mask2Former from a COCO json custom dataset? #296

Open Robotatron opened 1 year ago

Robotatron commented 1 year ago

I have my custom dataset as a json in COCO format.

The tutorials for MaskFormer and Mask2Former for Huggingface Vision work with a different custom dataset format (a weird RGB encoding with custom meaning of each channel).

Is there a simple Dataset implementation that takes the coco.json file and outputs the data in a format Mask2Former from Huggingface can work with? If not, is there a tutorial to how to convert a COCO JSON custom dataset to a dataset format needed for Huggingface?

NielsRogge commented 11 months ago

I'd recommend taking a look here: https://github.com/facebookresearch/detr/blob/3af9fa878e73b6894ce3596450a8d9b89d918ca9/datasets/coco.py#L74-L76. The data preparation is equivalent for MaskFormer/Mask2Former/OneFormer.

Basically, COCO stores segmentation masks as polygons, so you need to convert them to a set of binary masks, which is the format that the models expect.

Robotatron commented 11 months ago

Thanks @NielsRogge

The solution seems for me to write a custom dataset converter to convert my polygon annotations to the custom RGB format (R channel for classID, G channel for instance ID that) for Huggingface M2F implementation.

Another question if you dont mind - is it possible to load weights from Detectron2 trained with the official M2F implementation with Huggingface?

ps. I find it surprising that Huggingface doesn't have a built-in way to import datasets in COCO format, which is widely used in computer vision. Other CV libraries like fiftyone and detectron2 have this feature, but I couldn't find any contributions from open source developers for COCO format datasets in Huggingface. It's quite strange that it's not commonly used with the Huggingface ecosystem.

NielsRogge commented 11 months ago

The solution seems for me to write a custom dataset converter to convert my polygon annotations to the custom RGB format (R channel for classID, G channel for instance ID that) for Huggingface M2F implementation.

No this is actually not required, you only need to create a set of binary masks for each image. The custom RGB format was specific to the dataset used in my notebook.

Regarding supporting the COCO format, we do support it for all our object detection models: https://huggingface.co/docs/transformers/tasks/object_detection. However MaskFormer and friends' image processors expect segmentation maps with a single label per pixel.

If I find the time I can create a tutorial on this. For now I recommend using the convert_coco_polys_to_mask function.

Robotatron commented 11 months ago

However MaskFormer and friends' image processors expect segmentation maps with a single label per pixel.

Does that mean Mask2Former as implemented in Hugginface does not support instance segmentation with overlapping masks? E.g. a dataset with labels for "person" and "T-shirt", since polygons / pixel labels would overlap. So far it sounds that only semantic segmentation is supported with Huggingface, where the original Mask2Former implementation with Detectron2 also supports instance and panoptic segmentation with overlapping masks/polygons.

With Detectron2 framework I can easily train a model where polygons overlap each other (i.e. more then one label per pixel). I was hoping to switch to Huggingface since I worked with Mask2Former in Detectron.

If Mask2Former implementation of Huggingface does not support instance segmentation with overlapping masks are there any other modern instance segmentation models in Huggingface that do support overlapping polygons/masks?

NielsRogge commented 11 months ago

@Robotatron it does support it, however the image processor (which can be used to speed up data preparation) doesn't. So I'd advise to prepare the data yourself for the model, which consists of an image and multiple binary masks (one for each instance) per training example. You can then train it on overlapping masks.

cyh-0 commented 6 months ago

Hi @NielsRogge,

How does the Maskformer handle background classes for semantic segmentation tasks? Do we need to add both no-object and background for prediction (class + background + no-object)? I've noticed that the performance drops significantly if I use (class + background), instead of (class + background + no-object).

Cheers, yh

NielsRogge commented 6 months ago

@cyh-0 MaskFormer outputs a binary mask + class for each of its object queries (model.config.num_queries). If an image contains 2 semantic categories for instance, and the model uses 100 object queries, then 98 should have the "no object" class. So it's quite essential to include the "no object" class

cyh-0 commented 6 months ago

@NielsRogge Thanks for the explanation!

jetsonwork commented 3 months ago

Hi @NielsRogge, Thank you for your great job. I will be appreciated if you provide a short tutorial regarding using coco style dataset for fine tuning Mask2Fomer.

Best regards