transformer里pos embedding和vis emdedding的size对不上，无法相加

wennyHou commented 2 years ago

我首先用build_q2l函数建立了一个model，然后用一个randn tensor作为model的输入，发现模型在forward过程中会有这个问题。

Traceback (most recent call last):
  File "debug.py", line 7, in <module>
    output = model(input)
  File "/mnt/data3/ai/miniconda/envs/hwy_ReceiptCls/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/data3/houwanyi/ReceiptCls/query2labels/lib/models/query2label.py", line 78, in forward
    hs = self.transformer(self.input_proj(src), query_input, pos)[0] # B,K,d
  File "/mnt/data3/ai/miniconda/envs/hwy_ReceiptCls/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/data3/houwanyi/ReceiptCls/query2labels/lib/models/transformer.py", line 107, in forward
    memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)
  File "/mnt/data3/ai/miniconda/envs/hwy_ReceiptCls/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/data3/houwanyi/ReceiptCls/query2labels/lib/models/transformer.py", line 134, in forward
    output = layer(output, src_mask=mask,
  File "/mnt/data3/ai/miniconda/envs/hwy_ReceiptCls/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/data3/houwanyi/ReceiptCls/query2labels/lib/models/transformer.py", line 253, in forward
    return self.forward_post(src, src_mask, src_key_padding_mask, pos)
  File "/mnt/data3/houwanyi/ReceiptCls/query2labels/lib/models/transformer.py", line 217, in forward_post
    q = k = self.with_pos_embed(src, pos)
  File "/mnt/data3/houwanyi/ReceiptCls/query2labels/lib/models/transformer.py", line 208, in with_pos_embed
    return tensor if pos is None else tensor + pos
RuntimeError: The size of tensor a (1024) must match the size of tensor b (2048) at non-singleton dimension 2

sorrowyn commented 2 years ago

RuntimeError: The size of tensor a (1024) must match the size of tensor b (2048) at non-singleton dimension 2

wennyHou commented 2 years ago

RuntimeError: The size of tensor a (1024) must match the size of tensor b (2048) at non-singleton dimension 2

是什么原因会导致图像的embedding size和的pos size不一致而无法相加呢？

sorrowyn commented 2 years ago

你对pos_embedding做多个实例化或者你试试这个 PositionEmbeddingLearned(nn.Module): https://github.com/facebookresearch/detr/blob/main/models/position_encoding.py

nekoosuki commented 2 years ago

加上--keep_input_proj参数再试试

macqueen09 commented 2 years ago

same error 我在未改动代码时复现模型训练单卡复现训练时报错 swin_L_384_22k 会报错 resnet backbone不报错会不会是后几次commit时改什么东西没测试呀

python3 -m torch.distributed.launch --nproc_per_node=1 \ main_mlc.py \ --backbone swin_L_384_22k --dataname coco14 --batch-size 8 --print-freq 100 \ --output "/home/bpfs/querry2" \ --world-size 1 --rank 0 --dist-url tcp://127.0.0.1:3717 \ --gamma_pos 0 --gamma_neg 2 --dtgfl \ --epochs 80 --lr 1e-4 --optim AdamW \ --num_class 80 --img_size 384 --weight-decay 1e-2 \ --cutout --n_holes 1 --cut_fact 0.5 \ --hidden_dim 2048 --dim_feedforward 8192 \ --enc_layers 1 --dec_layers 2 --nheads 4 \ --early-stop --amp 报错： `No inplace_abn found, please make sure you won't use TResNet as backbone! No inplace_abn found, please make sure you won't use TResNet as backbone! single GPU train | distributed init (local_rank 0): tcp://127.0.0.1:3717 [05/29 23:54:08.581]: Command: main_mlc.py --local_rank=0 --backbone swin_L_384_22k --dataname coco14 --batch-size 8 --print-freq 100 --output /home/bpfsrw3/makaili/models/querry2 --world-size 1 --rank 0 --dist-url tcp://127.0.0.1:3717 --gamma_pos 0 --gamma_neg 2 --dtgfl --epochs 80 --lr 1e-4 --optim AdamW --num_class 80 --img_size 384 --weight-decay 1e-2 --cutout --n_holes 1 --cut_fact 0.5 --hidden_dim 2048 --dim_feedforward 8192 --enc_layers 1 --dec_layers 2 --nheads 4 --early-stop --amp [05/29 23:54:08.583]: Full config saved to /home/bpfsrw3/makaili/models/querry2/config.json [05/29 23:54:08.583]: world size: 1 [05/29 23:54:08.584]: dist.get_rank(): 0 [05/29 23:54:08.584]: local_rank: 0 [05/29 23:54:08.584]: build model build_q2l 1 build_backbone 2 swin_L_384_22k 00pretrained model 11pretrained model 22pretrained model backbone done build_backbone success 2 set model.input_proj to Indentify! [05/29 23:54:27.831]: build model success [05/29 23:54:33.135]: make criterion Using Cutout!!! loading annotations into memory... Done (t=16.46s) creating index... index created! loading annotations into memory... Done (t=9.67s) creating index... index created! len(train_dataset): 82783 len(val_dataset): 40504 /home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 32 worker processes in total. Our suggested max number of worker in current system is 28, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked))

Traceback (most recent call last): File "main_mlc.py", line 727, in main() File "main_mlc.py", line 224, in main return main_worker(args, logger) File "main_mlc.py", line 351, in main_worker loss = train(train_loader, model, ema_m, criterion, optimizer, scheduler, epoch, args, logger) File "main_mlc.py", line 481, in train output = model(images) File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/query2label.py", line 78, in forward hs = self.transformer(self.input_proj(src), query_input, pos)[0] # B,K,d File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 108, in forward memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed) File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 135, in forward src_key_padding_mask=src_key_padding_mask, pos=pos) File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 247, in forward return self.forward_post(src, src_mask, src_key_padding_mask, pos) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 214, in forward_post q = k = self.with_pos_embed(src, pos) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 207, in with_pos_embed return tensor if pos is None else tensor + pos RuntimeError: The size of tensor a (1536) must match the size of tensor b (2048) at non-singleton dimension 2 Killing subprocess 4198 Traceback (most recent call last): File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in main() File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/bpfsrw3/makaili/software/py37n/bin/python3', '-u', 'main_mlc.py', '--local_rank=0', '--backbone', 'swin_L_384_22k', '--dataname', 'coco14', '--batch-size', '8', '--print-freq', '100', '--output', '/home/bpfsrw3/makaili/models/querry2', '--world-size', '1', '--rank', '0', '--dist-url', 'tcp://127.0.0.1:3717', '--gamma_pos', '0', '--gamma_neg', '2', '--dtgfl', '--epochs', '80', '--lr', '1e-4', '--optim', 'AdamW', '--num_class', '80', '--img_size', '384', '--weight-decay', '1e-2', '--cutout', '--n_holes', '1', '--cut_fact', '0.5', '--hidden_dim', '2048', '--dim_feedforward', '8192', '--enc_layers', '1', '--dec_layers', '2', '--nheads', '4', '--early-stop', '--amp']' returned non-zero exit status 1.`

主要就是这个

File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 214, in forward_post q = k = self.with_pos_embed(src, pos) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 207, in with_pos_embed return tensor if pos is None else tensor + pos RuntimeError: The size of tensor a (1536) must match the size of tensor b (2048) at non-singleton dimension 2 Killing subprocess 4198

zugofn commented 2 years ago

把hidden_dim设置成1536试试，--hidden_dim 1536

verazuo commented 1 year ago

可以参考下作者提供的这个配置文件，需要把hidden dim改成1024， dim_feedforward改成4096，img_size 384，其他的一些细节设置应该不影响代码运行，但如果是要复现应该也需要和作者设置的一样

saishkomalla commented 1 year ago

Maybe out of context of the original issue, but is there any concrete reason for suggesting dim_feedforward to be 4* hidden_dim? Can they be same? as in the paper, using d=d0 = 2432 for every other model say resnet50 d = d0 = 2048 ?!

Jianghold commented 1 month ago

请问楼主怎么解决的这个问题，我也碰到了相同的问题

SlongLiu / query2labels

transformer里pos embedding和vis emdedding的size对不上，无法相加 #16