Nanne / pytorch-NetVlad

Pytorch implementation of NetVlad including training on Pittsburgh.
435 stars 109 forks source link

What is the version of these package? #15

Closed ShilangChen1011 closed 5 years ago

ShilangChen1011 commented 5 years ago

What a remarkable work it is! But when I am running the code, it always has the error "Out of Memory".

RuntimeError: CUDA error: out of memory

I guess it is the vesion problem. And it has the warning followed.

UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1
  n_jobs = effective_n_jobs(self.n_jobs)

Can you tell the versions about followed packages or give me some advice to settle the problem?

PyTorch
Faiss
scipy
numpy
sklearn
h5py
tensorboardX
Nanne commented 5 years ago

How much memory does your graphics card have? You might need to reduce the batchsize if you don't have ~11GB memory.


Faiss 1.5.0
scipy 1.3.1
numpy 1.16.4
sklearn 0.19.1 (I think?)
h5py 2.8.0
tensorboardX (shouldnt matter)
ShilangChen1011 commented 5 years ago

Our graphics card are GTX 1080Ti and RTX 2080Ti. I tested separately it according to the default setting, but it occurs the error as aboved.

ShilangChen1011 commented 5 years ago

In addition, what is the relationship between 'batchSize' and 'cacheBatchSize'?

Nanne commented 5 years ago

The entire dataset is passed through the model and stored for the nearest neighbour retrieval, this is the caching step. Caching doesn't require any triplets, so there the cacheBatchSize is exactly how many images get passed.

For training it requires triplets, so if the batchSize is 4, then it actually passes 12 images through the network (1 anchor, 1 positive, and 1 negative). At which stage do you get the memory error, training or caching?

ShilangChen1011 commented 5 years ago

Thanks for your patience. It occurs the error at training stage.

ShilangChen1011 commented 5 years ago
/media/hdd/yike/anaconda3/envs/pytorch-netvlad/lib/python3.7/site-packages/sklearn/neighbors/base.py:421: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1
  n_jobs = effective_n_jobs(self.n_jobs)
/media/hdd/yike/anaconda3/envs/pytorch-netvlad/lib/python3.7/site-packages/sklearn/neighbors/base.py:421: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1
  n_jobs = effective_n_jobs(self.n_jobs)
/media/hdd/yike/anaconda3/envs/pytorch-netvlad/lib/python3.7/site-packages/sklearn/neighbors/base.py:421: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1
  n_jobs = effective_n_jobs(self.n_jobs)
Traceback (most recent call last):
  File "main.py", line 518, in <module>
    train(epoch)
  File "main.py", line 129, in train
    image_encoding = model.encoder(input)
  File "/media/hdd/yike/anaconda3/envs/pytorch-netvlad/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/hdd/yike/anaconda3/envs/pytorch-netvlad/lib/python3.7/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/media/hdd/yike/anaconda3/envs/pytorch-netvlad/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/hdd/yike/anaconda3/envs/pytorch-netvlad/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: out of memory

Like aboved.

Nanne commented 5 years ago

What happens if you set batchSize to 1?

And what is the output from lines 109 and 110?

ShilangChen1011 commented 5 years ago

It works right. It is error with my graphics card. Thx for your patience.

JeonHyeongJunKW commented 3 years ago

The out of memory problem was eliminated when pin_memory was replaced with False among some of the DataLoaderfunctions. example ) in train(epoch)

training_data_loader = DataLoader(dataset=sub_train_set, num_workers=opt.threads, batch_size=opt.batchSize, shuffle=True, collate_fn=dataset.collate_fn, pin_memory=False)