SHI-Labs / OneFormer

OneFormer: One Transformer to Rule Universal Image Segmentation, arxiv 2022 / CVPR 2023
https://praeclarumjj3.github.io/oneformer
MIT License
1.39k stars 129 forks source link

Resources required to fine-tune this model with swin-l #109

Closed th0mas-codes closed 6 months ago

th0mas-codes commented 6 months ago

I have let's say a set of 1,000 heavily annotated panoptic domain specific images that I would like to fine-tune. I have seen that the training of this model was based on a large number of a100's (8 a100's if I remember correctly)

Is this because visual transformers are generally just way more compute expensive to run ? If I only have a single A100 40gb available is it unrealistic to fine-tune anything between the range of 1,000 images to 5,000 images ?

I have found some really good colab notebooks around and managed the basics of setting it up to fine-tune for panoptic task but just my simple test of 1 image and its corresponding masks .png file I was getting out of memory issues on the A100 40gb based off of the:

model = AutoModelForUniversalSegmentation.from_pretrained("shi-labs/oneformer_coco_swin_large", is_training=True)

Maybe my setup is wrong or maybe I'm trying with too little resources ? I would really like to understand more about what resource demand am I looking at here since they seem very different to some of the other CNN model demands I have previously worked with.

I am fairly new to the domain so please excuse my lack of knowledge, any input is much appreciated!

alder-french-leviton commented 6 months ago

Hey @th0mas-codes, I'm in a similar boat with trying to figure out how to fine tune Oneformer for panoptic segmentation and curious as to why you closed this comment. Were you able to figure out how to fine tune it?

th0mas-codes commented 6 months ago

I have managed to train on my custom data which is showing good results as the initial test. The setup is a bit extensive to get the data in the correct format for me but i managed.

I have been able to turn the image_batch size down to 4 from 16 and train using a single a100 40gb GPU. I was expecting to have to use a pretrained checkpoint on lets say the ade20k dataset to then finetune on top of that for decent results but it shows that for my data training on a heavily annotated dataset of 1.000 images with 2 classes i was able to get decent results just by training from "scratch".

I ended up not moving forward with the huggingface pretrained method and following there guide on setting up the training using detectron2 and the train_net.py file with tweaks as there are a lot of neat things from detectron2 that i can use / is already setup. Im sure if you know more about what you are doing huggingface finetune maybe could be easier but im fairly new to all this.

alder-french-leviton commented 6 months ago

@th0mas-codes Thanks for getting back to me! I was hoping to use one of the pretrained models myself but it makes sense that doing so is more difficult. So it's good to know that if I end up training the model from scratch instead it can still work well. If I end up figuring out how to get the pretrained model fine tuned I'll let you know.

I have one question: What is your dataset format? I am a bit unclear what exactly the OneFormer model is expecting, at first I thought was COCO but now I'm thinking it's the Detectron2 format. Maybe even the same one used in this DigitalSreeni youtube video? If so that would be nice haha.

EDIT: Actually it seems like Detectron2 uses COCO format? I was able to figure out what the COCO JSON annotations look like, including their 'segmentation' polygons by creating an example in Makesense. But there's still one thing I'm unsure of, which is whether .PNG instance segmentation masks for each image are required for training OneFormer. I would think that the segmentation JSON has all the necessary data since it describes both the instances of segmentations as well as their class, yet the COCO train2017 panoptic dataset includes .PNG segmentation masks for each image.

Pari-singh commented 6 months ago

Hi @alder-french-leviton @th0mas-codes and @praeclarumjj3. I trained the DiNAT backbone model for my custom images and got decent results. Now, I want to perform finetuning on those trained weights for some of the internal tasks, where I will have 500 new images on a regular basis. Thus, you understand that combining entire data and retraining is a kill, hence I am looking for a way to be able to finetune the weights on oncoming 500 images. However, I couldn't find a way to freeze layers for DiNAT. The config file (unlike that for resnet) does not have FREEZE option for MODEL.BACKBONE. Can any of you give more info on how to approach this problem.

Thanks

alder-french-leviton commented 5 months ago

@Pari-singh I ended up forgetting about trying to get OneFormer working using the Detectron framework because I was running into so many errors when trying to setup its environment (it seems to require old libraries and an old CUDA Toolkit). Instead, I tried using HuggingFace's OneFormer, and after a lot of elbow grease I was able to get it working and have been training OneFormer for panoptic tasks, with batching, using the OneFormer model from HuggingFace with the SWIN Tiny backbone. If you want to do this yourself a good starting point is this notebook. I might try to make a tutorial based off my notebook so others can see how to use the OneFormer model for panoptic training with batching themselves, but I'm busy so it will be a while before I get it out.

jetsonwork commented 3 months ago

@alder-french-leviton Hi could you please share the tutorial if it is ready? Thanks!

jetsonwork commented 3 months ago

@th0mas-codes Hi, is it possible to share the codes? Thanks.

alder-french-leviton commented 3 months ago

@Jetsonwork Sorry, I'm busy with other work at the moment but I'm hoping to share the tutorial in the next month or two. However, I can confirm that it is possible to train with oneformer for custom datasets for panoptic segmentation with batching via the oneformer huggingface model and huggingface platform. If you are trying to train with oneformer's huggingface model you can ask me specific questions on this thread and I'll reply to try and steer you in the right direction.

jetsonwork commented 3 months ago

@alder-french-leviton Thank you. I have annotated my images in coco format, and I have already review this tutorial , but I don't know what changes should I make to feed my dataset to the oneformer model for panoptic or instance segmentation. Thanks again for your help.

alder-french-leviton commented 3 months ago

@jetsonwork Sounds like you're off to a good start since you already have your data in coco format. However, the data format used to train the HF Oneformer model is actually not the COCO data format. Instead, each data point it trains with consists of an image, and its associated PNG bitmask (like this) that uses different pixel colors to indicate different segmentations. Fortunately, you can convert your COCO format data to this kind of bitmask without too much hassle, this tutorial helps. Once thats done, there are basically 2 big steps to getting the hugging face oneformer working with your custom data. 1. Get hf oneformer model to train without batching using your custom dataset. This can be done by modifying the tutorial you linked. The main thing you will want to do in this step is modify the CustomDataset class's getitem function to use your data, rather than the 2 cat images it has hardcoded in the tutorial. 2. Get batching working - this required a lot of stuff including a custom collate function, so let me know when you finish step 1 and I can give you instructions for batching. Btw, are you trying to do instance or panoptic segmentation?

EricLe-dev commented 2 months ago

@alder-french-leviton thank you for the detailed clarification. I'm having an issue with finetuning OneFormer for segmentation of multiple GPUs. I did follow this tutorial and was able to finetune OneFormer. However, when I try to finetune the model on multi GPUs, it did not work.

I did two approaches:

1. Using DataParallel

import torch.nn as nn
# some code the same as your tutorial
processor.image_processor.num_text = model.config.num_queries - model.config.text_encoder_n_ctx

train_dataset = CustomDataset(processor)
train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True, num_workers=16)
optimizer = AdamW(model.parameters(), lr=5e-5)

model = nn.DataParallel(model)
device = 'cuda'
model.to(device)
model.train()

for epoch in range(20):  # loop over the dataset multiple times
    for batch in train_dataloader:
        # zero the parameter gradients
        optimizer.zero_grad()
        batch = {k:v.to(device) for k,v in batch.items()}

        # forward pass
        outputs = model(**batch)

        # backward pass + optimize
        loss = outputs.loss
        print("Loss:", loss.item())
        loss.backward()
        optimizer.step()

This code running normally but just only GPU:0 was utilized, the other GPUs do not seems to work. Here is the result from nvidia-smi while it's running:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.239.06   Driver Version: 470.239.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:3B:00.0 Off |                  N/A |
| 55%   58C    P2   196W / 356W |  20651MiB / 24268MiB |     71%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:3C:00.0 Off |                  N/A |
| 59%   57C    P2   121W / 356W |      8MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:5E:00.0 Off |                  N/A |
| 53%   54C    P2   120W / 356W |      8MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:86:00.0 Off |                  N/A |
| 53%   47C    P2   118W / 356W |      8MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  Off  | 00000000:D8:00.0 Off |                  N/A |
| 60%   58C    P2   137W / 356W |      8MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  Off  | 00000000:D9:00.0 Off |                  N/A |
| 60%   58C    P2   111W / 356W |      8MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2170      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A   2809467      C   python                          20643MiB |
|    1   N/A  N/A      2170      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2170      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      2170      G   /usr/lib/xorg/Xorg                  4MiB |
|    4   N/A  N/A      2170      G   /usr/lib/xorg/Xorg                  4MiB |
|    5   N/A  N/A      2170      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

2. Using Accelerate Following this tutorial, I modified the code as following:

processor.image_processor.num_text = model.config.num_queries - model.config.text_encoder_n_ctx

train_dataset = CustomDataset(processor)
# val_dataset = CustomDataset(processor)

train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True, num_workers=16)
optimizer = AdamW(model.parameters(), lr=5e-5)

accelerator = Accelerator()
model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader)

model.train()

for epoch in range(20):  # loop over the dataset multiple times
    for batch in train_dataloader:

        # zero the parameter gradients
        optimizer.zero_grad()
        # batch = {k:v.to(device) for k,v in batch.items()}

        # forward pass
        outputs = model(**batch)

        # backward pass + optimize
        loss = outputs.loss
        print("Loss:", loss.item())
        accelerator.backward(loss)
        optimizer.step()

This code was running normally, except only GPU:0 works.

I'm quite sure that I'm missing something here. Can you please point me to the right direction? Thank you so much!

alder-french-leviton commented 2 months ago

@EricLe-dev you're welcome! Feels good to help out instead of asking questions for once. As for your current issue. I'm afraid I have not tried training with multiple GPUs myself , so I'm not sure how to resolve your issues. However, I can share the code for my dataset, dataloader, and custom collate function that I used to get batching working since I think that could be useful for your situation. I apologize for the messy code, I don't have time to clean it up but I figure its better than nothing. Note that this code was used to train huggingface's OneFormer model for panoptic segmentation. Anyways here it is:

Dataset:

from torch.utils.data import Dataset
import numpy as np
from PIL import Image
import requests 
import json
import os

class CustomOneformerHfDataset(Dataset):
    def __init__(self, processor, instance_to_semantic_filename, img_dir, transform=None):
        self.processor = processor # pass AutoProcessor.from_pretrained("shi-labs/oneformer_ade20k_swin_tiny")
        with open(instance_to_semantic_filename, "r") as json_file:
            json_data = json.load(json_file) 
        self.instance_id_to_semantic_ids = json_data
        self.length = len(json_data)
        self.img_dir = img_dir #e.g. "./test_folder"
        mask_files = {}
        image_files = {}
        for id, img_meta in json_data.items():
            file_name = img_meta["file_name"]
            image_files[int(id)] = file_name
            mask_files[int(id)] = f"{file_name.replace('.jpg', '_mask.png')}"
        self.mask_files = mask_files # maps image ids to their mask paths
        self.image_files = image_files
        self.transform = transform
        # ALDER UNCOMMENTED LINE BELOW 1/29/2024
        self.task_inputs = ["panoptic"] * self.length

    def __len__(self):
        return self.length

    def __getitem__(self, idx):
        # Each item must have 3 parts from user: 1. Image .jpg, 2. Mask .png, 3.instance_id_to_semantic_id dict
        # 1. Get Image
        image_path = os.path.join(self.img_dir, self.image_files[idx])        
        image = Image.open(image_path)
        # 2. Get Mask
        mask_path = os.path.join(self.img_dir, self.mask_files[idx])        
        mask = Image.open(mask_path)
        # 3. Get instance_id_to_semantic_id dict
        instance_id_to_semantic_id = self.instance_id_to_semantic_ids[str(idx)]["instance_id_to_semantic_id"]

        # LAST STEP - Process the objects and return them in the proper format for OneFormer HF
        # Transform image and mask so all items are same shape (needed for batched data)
        if self.transform:
            image, mask = self.transform(image, mask)
        # Convert keys and values to integers since this is format processor expects
        int_dict = {int(key): int(value) for key, value in instance_id_to_semantic_id.items()}
        #NOTE: below is a hacky fix for KeyError: 0 in:
        # class_id = instance_id_to_semantic_id[label + 1 if reduce_labels else label]
        for i in range(150):
            if i not in int_dict:
                int_dict[i] = i
        # Get map from mask
        map = np.array(mask)
        # Use processor to convert this to a list of binary masks, labels, text inputs and task inputs
        #inputs = self.processor(images=image, segmentation_maps=map, task_inputs=["panoptic"], return_tensors="pt", instance_id_to_semantic_id=int_dict)
        #inputs = self.processor(images=image, segmentation_maps=map, task_inputs=["panoptic"], return_tensors="pt", instance_id_to_semantic_id=int_dict)
        #inputs = self.processor(images=image, segmentation_maps=map, task_inputs=self.task_inputs, return_tensors="pt", instance_id_to_semantic_id=int_dict)
        # Idk why but the original notebook by Niels has the line below so it's here too. 
        #inputs = {k:v.squeeze() if isinstance(v, torch.Tensor) else v[0] for k,v in inputs.items()}
        #return inputs
        item_dict = {"image":image, "map":map, "int_dict":int_dict}
        return item_dict

Custom Collate:

def custom_collate(processor, batch):
    task_inputs = ["panoptic" for i in range(len(batch))]
    #print(batch[0])
    batch_maps = []
    batch_images = [] 
    batch_int_dicts = []
    for item in batch:
        batch_maps.append(item["map"])
        batch_images.append(item["image"]) 
        batch_int_dicts.append(item["int_dict"])
    processed_batch = processor(images=batch_images, segmentation_maps=batch_maps, task_inputs=task_inputs, return_tensors="pt", instance_id_to_semantic_id=batch_int_dicts)
    # Call the default collate function on the processed data
    #default_collated_batch = torch.utils.data.dataloader.default_collate(processed_batch)
    # ALDER JUST ADDED NEWLINE BELOW 1/29/2024
    #processed_batch = {k:v.squeeze() if isinstance(v, torch.Tensor) else v[0] for k,v in processed_batch.items()}
    return processed_batch

Dataloader initialization:

from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=lambda batch: custom_collate(processor, batch)) #NOTE: changed shuffle to false for testing
EricLe-dev commented 2 months ago

@alder-french-leviton Thank you so much for sharing. This is a very good start. I will start testing if I can use your code to run on multiple GPUs and I will share here if I have anything new.