Porting SegFormer to HuggingFace Transformers

NielsRogge commented 3 years ago

Hi guys,

First of all thanks for this impressive (and simple) model!

I'd like to port this model to HuggingFace Transformers, which, as you might know, is a library that includes a lot of Transformer-based models (mostly NLP models like BERT and RoBERTa, but recently I've added the Vision Transformer (ViT), DeiT and DETR to the library, so I think SegFormer definitely deserves its place there too!).

The API I had in mind could look something like this (very similar to ViT):

from transformers import SegFormerFeatureExtractor, SegFormerForImageSegmentation
from PIL import Image
import requests

feature_extractor = SegFormerFeatureExtractor.from_pretrained("nvidia/segformer-b0-fine-tuned-ade-512-512")
model = SegFormerForImageSegmentation("nvidia/segformer-b0-fine-tuned-ade-512-512")

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits # shape (batch_size, num_labels, height/4, width/4)

The main advantage would be that people could train the SegFormer model within a Colab notebook with ease just using a native PyTorch training loop or with frameworks like PyTorch Lighting, HuggingFace Accelerate, etc., and also perform inference very easily as shown above. No scripts required!

The feature extractor should not be a fully-fledged preprocessor, it would probably just need to resize + normalize images, such that they can be fed to the model. I guess resizing to 512x512 is a good default option. I would perhaps include a post_process method, that can be used to convert the logits of the model to an actual image of the semantic segmentation.

All model checkpoints can be hosted for free on the hub, under the NVIDIA namespace (which currently includes models like Megatron-GPT-2).

Are you interested in helping me finishing up this model? My main questions would be:

what are the most basic image + mask transformations that would work in order to perform inference + fine-tune on a custom dataset? What should the values of the image size be for each of the checkpoints? It seems that for the 512x512 ADE model, the shortest side is 512?
I guess that if the feature extractor resizes (rescales) images to 512x512, the corresponding masks also need to be resized. But as the model predicts masks at resolution 128x128, does the feature extractor need to resize them to this resolution?
how is the loss defined, is this just the CrossEntropyLoss between the predicted mask and the ground truth mask?

xieenze commented 3 years ago

Hi, Thanks! It is so great if SegFormer will be in HuggingFace!

For your questions:

(0) basic image + mask transformations for inference and finetune on custom datasets

First, let me take line 7-31 in ade20k config as example. For inference on a new image, you only need three transformations: For AlignedResize you can refer to (1).

For finetune on other custom datasets, I believe random scale, crop and flip is necessary for augmentation.

Tips: the definition of these transformations can be find here.

(1) For inference on custom dataset, we have two modes:

whole image test. For example, if given an image with size 256x300 from ADE20K, we first scale the image's short side=512, e.g. (256x300 --> 512x600) then align the shape that is divisible by 32 (512x600 --> 512x608), this step we call AlignedResize.
slide window test. If the image size is too large, e.g. cityscapes with 1024x2048 image resolution, we can use overlapping slide window on images, e.g. window_size=1024x1024 and stride=768. Also make sure the window shape is divisible by 32.

(2) The mask shape:

The output feature of the network is 1/4 resolution of original image. Then we need to upsample to the input image's size and calculate the per-pixel classification loss.

(3) loss function:

Yes, we only use the CrossEntropyLoss defined here.

If you have other questions, feel free to let me know. Thanks!

NielsRogge commented 3 years ago

Ok, thanks for the detailed answer! Each model in HuggingFace Transformers requires 3 files to be implemented:

configuration_segformer.py, which defines the hyperparameters
modeling_segformer.py, which implements the model
feature_extraction_segformer.py, which implements the feature extractor

Actually, I've finished the modeling part (modeling_segformer.py). I've defined two models, SegFormerModel (which is the hierarchical Transformer encoder only) and SegFormerForImageSegmentation (which is SegFormerModel with the all-MLP decoder + classifier on top). This is to streamline the API with BERT for example, which has BertModel (Transformer encoder without any head on top) and BertForSequenceClassification (which is BertModel with a linear layer on top of the [CLS] token) - among other head models.

I'm now working on SegFormerFeatureExtractor (which can be used to prepare images + segmentation maps for the model). I'm not going to include random scale, crop and flip (if people want to use those, they can use torchvision's transforms for example). It will only define two necessary transformations, namely AlignedResize and normalize. I've replaced mmcv.rescale and mmcv.resize by self.resize within the feature extractor, as the SegFormerFeatureExtractor inherits from a class called ImageFeatureExtractionMixin that has a resize method implemented. I guess I can also remove the random_scale from the AlignedResize class, as each of the checkpoints has image scales defined.

However, I need to know what the image scale is for each of the fine-tuned checkpoints. Is it correct that the image scale is (2048, 512) for the AlignedResize for each of the ade20k released checkpoints, and (2048, 1024) for the Resize for each of the cityscapes checkpoints?

So what I now basically need is a dummy image (for example the one from the demo), resize + normalize it and then forward it through the original implementation, forward it through my implementation and verify whether the logits are exactly equal. Is there an easy way to do this with the original implementation? I'd like to use demo.py, but that also seems to include a RandomFlip transformation (however flip is set to False?).

xieenze commented 3 years ago

Hi, Although random flip is defined in the config's test_pipeline, it is not used for inference unless you set aug-test=True(means multi-scale+flip test) in tools/test.py to evaluate the dataset.

But for image_demo.py, it does not use Flip. It only contains AlignedResize+Normalize. No other steps are needed.

So, you can directly compare the results of your implementation and the original one(image_demo.py).

About image_scale=(2048, 512), it means to scale the short_size to 512 for most case, but the longer_size should<=2048 to avoid GPU OOM (because there are very few images which have extremely large aspect ratio). The scale_factor is defined as min(512/short_side, 2048/long_side)

Example1: the original image is (256,304), the scale_factor is min(512/256, 2048/304)=2, after AlignedResize, it is (512,608). Example2: the original image is (256,2048), the scale_factor is min(512/256, 2048/2048)=1, after AlignedResize, it should be (256,2048)

But it is fine to simply scale the short_side=512. Example2's case is very rare.

For cityscapes, all the image has same shape(1024x2048), and the image_scale is also set to (1024,2048). In this case the image will not be resized because scale_factor=1.

NielsRogge commented 3 years ago

Ok great, thanks for the response. I've just finished the conversion script (which let's me convert the original checkpoints to their HuggingFace counterpart). Currently, it only complains about the following parameters which are not converted:

RuntimeError: Error(s) in loading state_dict for SegFormerForImageSegmentation:
        Unexpected key(s) in state_dict: "conv_seg.weight", "conv_seg.bias".

I see that the SegFormer decoder head inherits from BaseDecodeHead, which defines the conv_seg linear layer. But does it actually use this layer? I see the linear layer for getting the logits is called linear_pred.

The documentation page will look like this:

Next step now is to compare the implementations on the same image.

xieenze commented 3 years ago

Hi, The layer conv_seg is not used, you can remove it. The documentation page looks very nice!

NielsRogge commented 3 years ago

Ok thanks!

I'm currently testing my implementation and the original one on the same image (one from ADE20k). However, when comparing the pixel values prepared by SegFormerFeatureExtractor to the ones that are created in mmseg, it turns out these are not exactly equal.

The shapes are equal, as well as the initial values (printing pixel_values[0,:3,:3,:3])

# my implementation
tensor([[[-0.7993, -0.7993, -0.8164],
         [-0.7993, -0.7993, -0.8164],
         [-0.7993, -0.7993, -0.8164]],

        [[-0.1975, -0.1975, -0.2150],
         [-0.1975, -0.1975, -0.2150],
         [-0.1975, -0.1975, -0.2150]],

        [[ 0.6705,  0.6705,  0.6531],
         [ 0.6705,  0.6705,  0.6531],
         [ 0.6705,  0.6705,  0.6531]]])

# original implementation
tensor([[[-0.7993, -0.7993, -0.8164],
         [-0.7993, -0.7993, -0.8164],
         [-0.7993, -0.8164, -0.8164]],

        [[-0.1975, -0.1975, -0.2150],
         [-0.1975, -0.1975, -0.2150],
         [-0.1975, -0.2150, -0.2150]],

        [[ 0.6705,  0.6705,  0.6531],
         [ 0.6705,  0.6705,  0.6531],
         [ 0.6705,  0.6531,  0.6531]]], device='cuda:0')

And I've also checked the final values (pixel_values[0,-3:,-3:,-3:]), these are also equal. But comparing the sum of the pixel values (pixel_values.sum()), this is tensor(92383.2031) for my implementation (which uses PIL as backend) and tensor(89296.2734) for the original implementation. I see cv2 is used as a backend, which might explain the difference. Will this have a big impact on performance?

xieenze commented 3 years ago

Hi, I am not sure whether use CV2 or PIL to read images will cause a slight difference.
I think you can visualize the result and compare them.
Also, you can calculate the IoU between your implementation and the original one. If they are almost the same, I believe there is no problem.

Chrisding commented 3 years ago

Ok thanks!

I'm currently testing my implementation and the original one on the same image (one from ADE20k). However, when comparing the pixel values prepared by SegFormerFeatureExtractor to the ones that are created in mmseg, it turns out these are not exactly equal.

The shapes are equal, as well as the initial values (printing pixel_values[0,:3,:3,:3])
# my implementation
tensor([[[-0.7993, -0.7993, -0.8164],
         [-0.7993, -0.7993, -0.8164],
         [-0.7993, -0.7993, -0.8164]],

        [[-0.1975, -0.1975, -0.2150],
         [-0.1975, -0.1975, -0.2150],
         [-0.1975, -0.1975, -0.2150]],

        [[ 0.6705,  0.6705,  0.6531],
         [ 0.6705,  0.6705,  0.6531],
         [ 0.6705,  0.6705,  0.6531]]])

# original implementation
tensor([[[-0.7993, -0.7993, -0.8164],
         [-0.7993, -0.7993, -0.8164],
         [-0.7993, -0.8164, -0.8164]],

        [[-0.1975, -0.1975, -0.2150],
         [-0.1975, -0.1975, -0.2150],
         [-0.1975, -0.2150, -0.2150]],

        [[ 0.6705,  0.6705,  0.6531],
         [ 0.6705,  0.6705,  0.6531],
         [ 0.6705,  0.6531,  0.6531]]], device='cuda:0')
And I've also checked the final values (pixel_values[0,-3:,-3:,-3:]), these are also equal. But comparing the sum of the pixel values (pixel_values.sum()), this is tensor(92383.2031) for my implementation (which uses PIL as backend) and tensor(89296.2734) for the original implementation. I see cv2 is used as a backend, which might explain the difference. Will this have a big impact on performance?

Hi @NielsRogge , I've once compared PIL vs. cv2 on another semantic segmentation project. They don't seem to introduce major differences (either single scale inference or multi-scale inference).

But I do notice some differences between cv2 and PIL in image resizing as recently pointed out by Jun-Yan Zhu and Richard Zhang et al. In particular PIL introduces anti-aliasing when downsampling while cv2 does not: https://twitter.com/junyanz89/status/1385654389872934926?s=20 https://github.com/GaParmar/clean-fid Not sure if this is partly related to your question.

In my case, switching to PIL gives slight improvements on multi-scale testing (maybe helped when images are resized to scale 0.5/0.75). But the improvements are marginal and not sure if it's statistically significant.

NielsRogge commented 3 years ago

Ok, thanks for the information. Yeah in HuggingFace Transformers, all feature extractors (ViTFeatureExtractor, DeiTFeatureExtractor, DetrFeatureExtractor) currently rely on PIL, and they are not meant to be fully-fledged preprocessors, for now they just support some basic operations (resizing, center cropping, normalizing images). I guess the results will not be significantly different, so it's safe to use PIL.

In terms of progress, my current implementation is giving me the same logits as the original implementation! Here's a notebook that performs inference on an image from the ADE20k dataset:

https://colab.research.google.com/drive/17i0XkXKYWgRGUd8J72jwPs_IDRmk0EIa?usp=sharing

Can you help me fix the visualization part? I've set the notebook to be editable.

Note: the colab is with random weights for now, once the visualization part works I'll upload the first weights to HuggingFace's hub, and we'll get a nice segmentation map :)

I'm perhaps planning to add the visualization part to the feature extractor, such that people can simply do feature_extractor.show_results(image, logits). However, this will create an additional dependency on matplotlib, which I'm not sure the authors of HuggingFace are going to like.

Also an additional question: so during training, the model outputs logits of shape (batch_size, num_labels, height/4, width/4), and these need to be upscaled again to the original image size before computing the loss. So this is upscaling to the crop_size, right? Since all images are cropped and padded up to the same crop_size?

I will probably also add random cropping to the feature extractor, such that it can also be used to fine-tune on a custom dataset.

I also plan to add palettes to SegFormerConfig.

xieenze commented 3 years ago

Hi @NielsRogge I have finished the vis code in colab, please check it.

If you do not want to involve matplotlib, you can save the image using PIL instead of showing it.

During training, yes, the output feature map will be upsampled from (B,C,H/4,W/4) to (B,C,H,W) and calculate loss.

NielsRogge commented 3 years ago

Ok great, thanks for looking into it!

Inference now works: https://colab.research.google.com/drive/1Aq2uelaRNubW1iduc2oh0kkUIYamgZkY?usp=sharing

I've uploaded weights of the b0 model to the hub as can be seen here. If the project is finished, I can upload all model variants to the NVIDIA namespace.

I think the main thing to work on to finish this is the feature extractor. So if I understand it correctly:

during training, images are resized and then randomly cropped to a certain crop_size (omitting the other augmentations). Next, images are padded up to the same crop_size, and then training is performed using cross-entropy loss. So effectively a loss is also incurred for padded pixels? The model outputs logits of shape (batch_size, num_classes, height of the crop_size / 4, width of the crop_size / 4). These are then upscaled to the crop_size and cross-entropy is calculated.

If you can confirm this, then I'll let the feature extractor support:

resizing + aligned resizing (the latter is only required for ADE20k)
normalizing
padding

xieenze commented 3 years ago

Hi @NielsRogge

Your understanding is mostly correcty, only for pad pixels, we will ignore the loss of pad_pixel, only calculate the loss of valid pixels.

NielsRogge commented 3 years ago

So the labels are set to -100 for pad pixels? Can you point me to where this happens in the code?

xieenze commented 3 years ago

No, from the config we can see that seg_pad_val=255. And from decode_head.py the ignore_index=255

So we pad seg_map = 255 and set the ignore_index=255 instead of 100. But you can set any value, just ensure seg_pad_val==ignore_index

NielsRogge commented 3 years ago

Another question: when calculating the loss, the logits need to be upsampled again as shown here:

https://github.com/NVlabs/SegFormer/blob/93301b33d7b7634b018386681be3a640f5979957/mmseg/models/decode_heads/decode_head.py#L220-L224

Why are we taking seg_label.shape[2:]?

If I understand correctly, the input to the model is of shape (batch_size, num_channels, height, width) and the corresponding labels (ground truth segmentation maps of a batch of images) of shape (batch_size, height, width). So I would assume .shape[1:] instead of .shape[2:].

xieenze commented 3 years ago

Hi， shape[2:] indicates [h, w], which means upsample the seg_logit to the shape of seg_label

NielsRogge commented 3 years ago

Yes I understand that. But why not shape[1:] instead of shape[2:]? The seq_label has shape (batch_size, height, width) right, or not?

xieenze commented 3 years ago

I am not sure. maybe you need to check the size of seg_label. But the size should be (h,w) anyway.

NielsRogge commented 3 years ago

Hi,

I'm also defining a SegFormerForImageClassification, as you can also use the SegFormer encoder to classify images. I see here that the classification head projects from the hidden size of the last block to num_labels.

However, the hidden_states of the last block are of shape (batch_size, embed_dim, height // 4, width // 4). So how are the classification logits computed from the last hidden states?

xieenze commented 3 years ago

Hi, It is great if you can support image classification! Our method has a strong relationship with PVTv2. You can refer to line 288-298 in PVTv2 classification.

In detail, for classification, we only use the last stage feature (with shape h/32 x w/32), add layer_norm->global_pool->linear classification head on the last feature map.

By the way, our PVTv2 is also a very strong vision transformer backbone, does HuggingFace consider supporting it? If you can support SegFormerForImageClassification, it is super easy to support PVTv2 in technical.

NielsRogge commented 3 years ago

By the way, our PVTv2 is also a very strong vision transformer backbone, does HuggingFace consider supporting it? If you can support SegFormerForImageClassification, it is super easy to support PVTv2 in technical.

Yes, that should be possible. However, I first will finish SegFormer. ~~Another question: is the model trained with targets between 1 and 150? And does the model output labels between 1 and 150 (i.e. not between 0 and 149) for ADE20k?~~ Update: seems like the labels are reduced by one when training, so the model outputs labels between 0 and 149. Can you point me to where this happens in the code? Update v2: found it, here.

To do:

add reduce_zero_label option to SegFormerFeatureExtractor: done
add padding of images + segmentation maps.

NielsRogge commented 3 years ago

@xieenze this is a notebook illustrating fine-tuning SegFormer on custom data: https://colab.research.google.com/drive/15JeOp3KxEjeTxG74DZc1cGZVaT5gPQjj?usp=sharing

However, I'm not sure whether it works properly already (loss is going down nicely, but inference results don't look good). Could you review my notebook?

Also, regarding the segmentation maps of the ADE20k dataset: am I reading these maps in the correct way?

Thanks!

xieenze commented 3 years ago

Hi, I find that when doing training, the value of label is between [0,255], which is unreasonable. It should be [0-150] since ADE20K has 150 classes. I think there are some bugs in self.feature_extractor, because the value is correct when using PIL read from local. But after self.feature_extractor it is incorrect.

NielsRogge commented 3 years ago

Thanks for looking into it.

Is this part wrong?

if self.reduce_zero_label:
            if segmentation_maps is not None:
                for idx, map in enumerate(segmentation_maps):
                    if not isinstance(map, np.ndarray):
                        map = np.array(map)
                    # avoid using underflow conversion
                    map[map == 0] = 255
                    map = map - 1
                    map[map == 254] = 255
                    segmentation_maps[idx] = Image.fromarray(map.astype(np.uint8))

So what I do is: convert each segmentation map to a NumPy array, and then reduce the labels as was done in the original code.

xieenze commented 3 years ago

I don't know how do you calculate loss when the value=255. If you set 255 as ignore_index when calculate loss I think it is fine.

NielsRogge commented 3 years ago

Yes that's the case, as can be seen here. So then I'm mostly ready.

However, it's weird that the inference part doesn't look as expected, after fine-tuning the model. Or is this expected given the number of epochs?

NielsRogge commented 3 years ago

The "seg_pad_val" is used in class Pad, but the "ignore_index" is not used in loss "CrossEntropyLoss" during training, so does it really ignore the ignore_index during training?

I'm not sure what you mean. I'm setting ignore_index equal to 255 when defining the CrossEntropyLoss, so labels having a value of 255 will be ignored. For now, I have not added padding of images (only rescaling - no AlignedResize- + random cropping + normalization). It seems to work fine without padding, as all images have a size that's at least as big as the crop size. I will add padding later.

This part of the code:

# avoid using underflow conversion
map[map == 0] = 255
map = map - 1
map[map == 254] = 255

makes sure that background labels are ignored, right? All background labels, which are 0 in the official dataset, are replaced by 255.

I will today debug the fine-tuning notebook. Free free to help me out :)

For example, I will add metrics like pixel-wise accuracy and mIoU to the training loop, taking into account the ignore_index. Do you know an easy way to calculate those?

Also, at inference time, I'm using AlignedResize (but the model is trained with Resize + keep_ratio=True), could this be the issue?

NielsRogge commented 3 years ago

I'm have an own implementation, I'm not using the mmseg code base.

troylane commented 3 years ago

@NielsRogge Thank you for your work on porting this to HuggingFace. How are you progressing with this? I have tested your implementation (based on your repo) on a custom dataset. It is doing a decent job (use-case is building rooftop extraction), but I am finding the edges of the segmented buildings could benefit from a better pre-trained model, and optimization/configuration. I am working on the optimization now. Will you be uploading the larger pre-trained models (b1-b5) to HuggingFace soon? As an example here is an image illustrating inference on a image based on the model I have generated. Again, thank you!! AS you can see, not bad, but needs improvement.

NielsRogge commented 3 years ago

Hi,

Yeah I've tested it on a small dataset and it seems to work well. I'm currently working on another model, but once I have the time I'll add SegFormer to the Huggingface repo.

Really nice to see it works!! Thanks for trying it out

NielsRogge commented 3 years ago

Ok I'm finally done with adding other models to the library, I'll start working on adding SegFormer soon.

@troylane did you just use my notebook to fine-tune the model? How did you calculate pixel-wise accuracy?

troylane commented 3 years ago

@NielsRogge metrics i used for evaluate performance were IoU, F1, Precision, Recall. Note that the output for my use case was a predicted binary mask.

NielsRogge commented 3 years ago

Ok thanks. @xieenze I'm currently taking a look at converting the backbones-only. You said that for pre-training (i.e. image classification on ImageNet-1k), you perform classification as follows:

In detail, for classification, we only use the last stage feature (with shape h/32 x w/32), add layer_norm->global_pool->linear classification head on the last feature map.

However, layer norm after the 4 stages is not included in the checkpoints (like mit_b0.pth, which only includes norm1, norm2, norm3, norm4). So I have defined the image classification model this myself as follows:

sequence_output = outputs[0] # last stage features, of shape (batch_size, num_channels, height, width)

# reshape to (batch_size, height*width, hidden_size)
batch_size = sequence_output.shape[0]
sequence_output = sequence_output.reshape(batch_size, -1, self.config.hidden_sizes[-1])

# global pooling
sequence_output = sequence_output.mean(dim=1)

logits = self.classifier(sequence_output)

However, this is not really giving plausible predictions (e.g. it predicts the class "paper towel" on an image of 2 cats). Where can I find the implementation? I can only find this, however it seems to be commented out.

Are you using another way of pooling, like nn.AvgPool2d or nn.AdaptiveAvgPool2d?

NielsRogge commented 3 years ago

@xieenze any update on this?

xieenze commented 3 years ago

Hi sorry for forgetting reply. You can check PVTv2's implementation. Segformer's backbone design is almost same as PVTv2. It use 'mean' to pooling the tokens. https://github.com/whai362/PVT/blob/cceb465b7dfb2b7a48b39074a14a04dedab427e8/classification/pvt_v2.py#L292

NielsRogge commented 3 years ago

No problem. Ok, so what I did above seems to be correct. However, it doesn't seem to give reasonable predictions. What's the accuracy of these pretrained-only checkpoints on ImageNet-1k?

Is there an easy way to run these pre-trained only checkpoints?

UPDATE: fixed it :) was a bug on my side (last feature map shouldn't be reshaped). Now it's predicting "tabby cat" for the cats image :)

xieenze commented 3 years ago

Nice!

NielsRogge commented 3 years ago

I've uploaded all checkpoints to the hub: https://huggingface.co/models?other=segformer

Currently they're just under my namespace (nielsr), but if I get your approval, I can move them to the NVIDIA organization on the hub, such that people can load the models as follows:

from transformers import SegformerForSemanticSegmentation

model = SegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")

Also, are you interested in being added to that organization on the hub? This will give you automatic write access to all NVIDIA models on HuggingFace's hub. This would allow you to update models for example, write a model card, etc.

Let me know if you're interested :)

I will open a PR soon on HuggingFace Transformers.

xieenze commented 3 years ago

That is Nice!

Yeah, I believe you can move models to nvidia organization. It is good. You can add me to the organization on the hub.

Thank you for these great works!

NielsRogge commented 3 years ago

Thank you for these great works!

Thanks! Do you already have a username on hf.co? Then I can add you.

xieenze commented 3 years ago

I just create one account. My account name is 'xieenze'

NielsRogge commented 3 years ago

Great, I've added you. PR can be found here: https://github.com/huggingface/transformers/pull/14019

Are you interested in writing model cards for these? As you might know, each model on HuggingFace's hub has its own git repository, you can easily add a README using git add, git commit, git push.

I usually copy the README of an existing model on the hub (e.g. the one from ViT which can be found here), and then update it for another model.

xieenze commented 3 years ago

I am afraid that recently I do not have enough time to writing model cards and others. Maybe after CVPR ddl I would have some time.

NVlabs / SegFormer

Porting SegFormer to HuggingFace Transformers #20