facebookresearch / detr

End-to-End Object Detection with Transformers
Apache License 2.0
13.36k stars 2.41k forks source link

Linformer Addition #80

Open AlexAndrei98 opened 4 years ago

AlexAndrei98 commented 4 years ago

Linformer addition thoughts 🤷🏻‍♂️

Hey I just noticed that a linear transform was just created! I think it would be interesting to add it in DETR since we are doing image detection our sequences will naturally be longer! If anyone would be interested in adding it in their architecture I would be grateful since I am not quite experienced 😅in pytorch!

https://github.com/tatp22/linformer-pytorch/blob/master/README.md

kuixu commented 4 years ago

@AlexAndrei98 Here is a practice implementation of adding Linformer into DETR by replacing nn.MultiheadAttention with Linformer's LinearMultiheadAttention.

https://github.com/kuixu/Linear-Multihead-Attention

EmmanouelP commented 3 years ago

I've tried integrating the linear method for a custom dataset but the results were quite bad. Has anyone else tested this to report any comparable results?

alcinos commented 3 years ago

On coco, in a preliminary experiment, I managed to get results within 1-2 MAP from the baseline model. One of the keys is to use the same projection matrix across all attention layers. Best of luck

EmmanouelP commented 3 years ago

@alcinos Thank you very much for your thoughtful input. Do you know if this implementation works when I start my training from a given pre-trained model (from the ones that the authors provide) because I have tried that and it seems like the model is trying to train from scratch? I believe my data are quite a few (~8000 images for 3 classes) in order to achieve comparable results if I were to train from scratch.

alcinos commented 3 years ago

I don't think you can finetune a baseline model using linear attention. I'd suggest training a linformer model from scratch on coco then finetuning on your dataset

EmmanouelP commented 3 years ago

@alcinos I've tried training from scratch on coco by so far I didnt notice and decrease in loss or increase in the class_error. In my understaning the linear transformer implementation expects every image to be in the same shape so I discarded the original detr transforms and I reshape all my images to 629x751 before passing them to the backbone, so the outputs of the backbone are of final shape 20x24.

Then i modify transformer.py file in order to incorporate the LinearMultiheadAttention module as such (as instructed):

#self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
self.self_attn = LinearMultiheadAttention(d_model, nhead, dropout=dropout, seq_len=20*24, proj_k=128, param_sharing='layerwise') # where w, h are from `bs, c, h, w = src.shape`
#self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
#self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
self.self_attn = LinearMultiheadAttention(d_model, nhead, dropout=dropout, seq_len=100, proj_k=128, param_sharing='layerwise') # where num_queries = args.num_queries
self.multihead_attn = LinearMultiheadAttention(d_model, nhead, dropout=dropout, seq_len=20*24, proj_k=128, param_sharing='layerwise') # where w, h are from `bs, c, h, w = src.shape`

Could you please provide any specific insight on the above? Thanks in advance.

alcinos commented 3 years ago

I have not experimented very much, so it's hard to give exact feedback but I think it's better to use only one projection matrix shared for all layers. As for padding, you could in theory use the same transforms as we have currently. You just need to define your projection with dimension being the length of the largest sequence you may encounter, and if you get a smaller sequence you simply narrow the matrix. I honestly don't know what is the best solution. I would expect DETR to be more robust if you train it with varying sizes, but I don't have hard data to back this up.

nicolasugrinovic commented 3 years ago

@EmmanouelP were you able to make the Linformer or other linear transformer to work with DETR?