NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
8.51k stars 1.34k forks source link

Data augmentation for object detection #289

Open Alberto1404 opened 1 year ago

Alberto1404 commented 1 year ago

Hello, I am attempting to finetune the DETR-resnet50 using a custom dataset, reason why I was following your github tutorial (1) as well as Object detection with 🤗 Transformers (2) docs. I want to apply some extra data augmentation, as done in (2). However, my dataset is already in COCO format. That is, I already have my JSON files train, val and test. Reason why a class that extends torchvision.datasets.CocoDetection as done in (1) was the best option to follow. This is the baseline class to modify to add data augmentation

class CocoDetection(torchvision.datasets.CocoDetection):
    def __init__(self, img_folder, processor, train=True):
        ann_file = os.path.join(img_folder, "custom_train.json" if train else "custom_val.json")
        super(CocoDetection, self).__init__(img_folder, ann_file)
        self.processor = processor

    def __getitem__(self, idx):
        # read in PIL image and target in COCO format
        # feel free to add data augmentation here before passing them to the next step
        img, target = super(CocoDetection, self).__getitem__(idx)

        # preprocess image and target (converting target to DETR format, resizing + normalization of both image and target)
        image_id = self.ids[idx]
        target = {'image_id': image_id, 'annotations': target}
        encoding = self.processor(images=img, annotations=target, return_tensors="pt")
        pixel_values = encoding["pixel_values"].squeeze() # remove batch dimension
        target = encoding["labels"][0] # remove batch dimension

        return pixel_values, target

An example of augmentations to apply:

A.Compose(
        [
            A.CenterCrop(p=0.5),
            A.HorizontalFlip(p = 0.5),
            A.Lambda(image=mirror_pad, p = 0.5),
            A.ColorJitter(p = 0.5),
            A.CLAHE(),
        ],
        bbox_params=A.BboxParams(format='coco', label_fields=...)

The content of label_fields is quite confusing for me. Observe the following screenshot from 🤗Object detection: image

image

They are setting label_fields=["category"], but category in that code snippet refers to the category_ids instead of the categories (classes) as shown in Albumentations docs:

image image

Another question regarding the augmentations, where should be done the definition of the Composition of augmentations? We cannot access (as far as I know) to the label_fields in the __init__ (to be defined what exactly refers to this based on the answer to my previous question). Should be directly done in the __getitem__? In (2) it is used the 🤗 with_transform to apply the transformations to both images and labels. How can I use that method reading the dataset using the class from above?

SUMARIZE

Would it be possible to make a more in-depth and clear tutorial about how to train model for object detection (and applying data augmentation to both images and labels) using a CocoDetection class to process it? Any help on this would be highly appreciated.