facebookresearch / detr

End-to-End Object Detection with Transformers
Apache License 2.0
13.61k stars 2.45k forks source link

Input image with more than 3 channels #291

Open alanlukezic opened 3 years ago

alanlukezic commented 3 years ago

First, thanks for sharing this great work! I want to use DETR for object detection on images with 4 input channels (RGB + 1 channel with additional information). I modified the ResNet backbone, so that first conv layer (conv1) takes 4 input channels instead of 3 and copy the weight values of first 3 input channels from original conv1. I tried to switch on/off gradient propagation in first two backbone layers (which are originally not trained) and the loss decreases a bit after few epochs and then stays high. I also verified that the fourth channel is normalized and in the same range as the RGB channels. Any idea why the loss does not decrease as expected?

lvaleriu commented 3 years ago

That's interesting: i'm experimenting the same thing (RGB + 1 channel with additional information). No loss decrease at all on my side.

olivierp9 commented 3 years ago

do you have a code sample for this? i'll would like to try this as well

djramakrishna commented 1 year ago
def build_backbone(args):
    position_embedding = build_position_encoding(args)
    train_backbone = args.lr_backbone > 0
    return_interm_layers = args.masks
    backbone = Backbone(args.backbone, train_backbone, return_interm_layers, args.dilation)    
     for name, module in backbone.named_modules():
       if(name == "body"):
         module.conv1 = nn.Conv2d(input_image_channels, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    model = Joiner(backbone, position_embedding)
    model.num_channels = backbone.num_channels
    return model

@olivierp9 the above code block should do when added in the backbone.py. Just replace the 'input_image_channels' with the number of channels your dataset images may contain.