Open miriamrebekah opened 2 years ago
I am also seeing this error on any training I try with simclr_info_nce_loss
or multicrop_simclr_info_nce_loss
. It happens on the first epoch, likely when it is close to finishing.
File "/opt/conda/lib/python3.8/site-packages/vissl/losses/simclr_info_nce_loss.py", line 144, in forward
pos = torch.sum(similarity * self.pos_mask, 1)
RuntimeError: The size of tensor a (992) must match the size of tensor b (1024) at non-singleton dimension 1
After looking at how the loss is implemented, I believe to have found the reason. The tensor "a" is simply an unfinished batch, so you might have to remove these extra images or change the batchsize.
I have 386032 images, 8 GPUs, batchsize = 64.
Thus, on my case you can see that
386032 % (8 * 64) * 2 == 992
@miriamrebekah It might be that when using 2 datasets, your total image size is not a multiple of the total batch size any more.
P.S.: I have not tested my fix for it yet though, so this is just a hunch for now.
@iseessel , will you be able to take a look at this ? :)
Hi @miriamrebekah It should be supported. While I investigate, as a temporary solution, would you be able to create one filelist and entry in your dataset_catalog that has both datasets? It should be as simple as concatenating the two filelists, saving it as a .npy file and creating a new entry in your dataset_catalog.
@Pedrexus Are you using config.DATA.TRAIN.DROP_LAST=True like in simclr_8node_resnet.yaml
?
Hi @miriamrebekah It should be supported. While I investigate, as a temporary solution, would you be able to create one filelist and entry in your dataset_catalog that has both datasets? It should be as simple as concatenating the two filelists, saving it as a .npy file and creating a new entry in your dataset_catalog.
@Pedrexus Are you using config.DATA.TRAIN.DROP_LAST=True like in
simclr_8node_resnet.yaml
?
Hello @iseessel.
Yes, I have DROP_LAST set to true, but it keeps giving me the same error. For now, I manually removed the extra images from the .npy
file and it solved it.
@Pedrexus Is this the same config you had those errors with -- as this has DATA.TRAIN.DROP_LAST=False and DATA.TEST.DROP_LAST=False
Indeed I tested with a different config and forgot to update the right one. I just retried with DATA.TRAIN.DROP_LAST=True and DATA.TEST.DROP_LAST=True and it worked fine! No error!
Thanks for all your help!
Hi @miriamrebekah It should be supported. While I investigate, as a temporary solution, would you be able to create one filelist and entry in your dataset_catalog that has both datasets? It should be as simple as concatenating the two filelists, saving it as a .npy file and creating a new entry in your dataset_catalog.
@Pedrexus Are you using config.DATA.TRAIN.DROP_LAST=True like in
simclr_8node_resnet.yaml
?
Yes, I am doing this as a workaround! Thanks!
I'm trying to train on two datasets at once. I'm using npy file list files for my datsets. Is training on multiple datasets at once supported? I put both in my config file, but I just keep getting this tensor error:
My config looks like this: