Open wennyHou opened 2 years ago
RuntimeError: The size of tensor a (1024) must match the size of tensor b (2048) at non-singleton dimension 2
RuntimeError: The size of tensor a (1024) must match the size of tensor b (2048) at non-singleton dimension 2
是什么原因会导致图像的embedding size和的pos size不一致而无法相加呢?
你对pos_embedding做多个实例化
或者你试试这个
PositionEmbeddingLearned(nn.Module):
https://github.com/facebookresearch/detr/blob/main/models/position_encoding.py
加上--keep_input_proj参数再试试
same error 我在未改动代码时复现模型训练 单卡复现训练时报错 swin_L_384_22k 会报错 resnet backbone不报错 会不会是后几次commit时改什么东西没测试呀
python3 -m torch.distributed.launch --nproc_per_node=1 \ main_mlc.py \ --backbone swin_L_384_22k --dataname coco14 --batch-size 8 --print-freq 100 \ --output "/home/bpfs/querry2" \ --world-size 1 --rank 0 --dist-url tcp://127.0.0.1:3717 \ --gamma_pos 0 --gamma_neg 2 --dtgfl \ --epochs 80 --lr 1e-4 --optim AdamW \ --num_class 80 --img_size 384 --weight-decay 1e-2 \ --cutout --n_holes 1 --cut_fact 0.5 \ --hidden_dim 2048 --dim_feedforward 8192 \ --enc_layers 1 --dec_layers 2 --nheads 4 \ --early-stop --amp
报错:
`No inplace_abn found, please make sure you won't use TResNet as backbone!
No inplace_abn found, please make sure you won't use TResNet as backbone!
single GPU train
| distributed init (local_rank 0): tcp://127.0.0.1:3717
[05/29 23:54:08.581]: Command: main_mlc.py --local_rank=0 --backbone swin_L_384_22k --dataname coco14 --batch-size 8 --print-freq 100 --output /home/bpfsrw3/makaili/models/querry2 --world-size 1 --rank 0 --dist-url tcp://127.0.0.1:3717 --gamma_pos 0 --gamma_neg 2 --dtgfl --epochs 80 --lr 1e-4 --optim AdamW --num_class 80 --img_size 384 --weight-decay 1e-2 --cutout --n_holes 1 --cut_fact 0.5 --hidden_dim 2048 --dim_feedforward 8192 --enc_layers 1 --dec_layers 2 --nheads 4 --early-stop --amp
[05/29 23:54:08.583]: Full config saved to /home/bpfsrw3/makaili/models/querry2/config.json
[05/29 23:54:08.583]: world size: 1
[05/29 23:54:08.584]: dist.get_rank(): 0
[05/29 23:54:08.584]: local_rank: 0
[05/29 23:54:08.584]: build model
build_q2l 1
build_backbone 2 swin_L_384_22k
00pretrained model
11pretrained model
22pretrained model
backbone done
build_backbone success 2
set model.input_proj to Indentify!
[05/29 23:54:27.831]: build model success
[05/29 23:54:33.135]: make criterion
Using Cutout!!!
loading annotations into memory...
Done (t=16.46s)
creating index...
index created!
loading annotations into memory...
Done (t=9.67s)
creating index...
index created!
len(train_dataset): 82783
len(val_dataset): 40504
/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 32 worker processes in total. Our suggested max number of worker in current system is 28, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
Traceback (most recent call last):
File "main_mlc.py", line 727, in
主要就是这个
File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 214, in forward_post q = k = self.with_pos_embed(src, pos) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 207, in with_pos_embed return tensor if pos is None else tensor + pos RuntimeError: The size of tensor a (1536) must match the size of tensor b (2048) at non-singleton dimension 2 Killing subprocess 4198
把hidden_dim设置成1536试试,--hidden_dim 1536
可以参考下作者提供的这个配置文件,需要把hidden dim改成1024, dim_feedforward改成4096,img_size 384,其他的一些细节设置应该不影响代码运行,但如果是要复现应该也需要和作者设置的一样
Maybe out of context of the original issue, but is there any concrete reason for suggesting dim_feedforward to be 4* hidden_dim?
Can they be same? as in the paper, using d=d0 = 2432
for every other model say resnet50 d = d0 = 2048
?!
请问楼主怎么解决的这个问题,我也碰到了相同的问题
我首先用build_q2l函数建立了一个model,然后用一个randn tensor作为model的输入,发现模型在forward过程中会有这个问题。