Closed ShilangChen1011 closed 5 years ago
How much memory does your graphics card have? You might need to reduce the batchsize if you don't have ~11GB memory.
Faiss 1.5.0
scipy 1.3.1
numpy 1.16.4
sklearn 0.19.1 (I think?)
h5py 2.8.0
tensorboardX (shouldnt matter)
Our graphics card are GTX 1080Ti and RTX 2080Ti. I tested separately it according to the default setting, but it occurs the error as aboved.
In addition, what is the relationship between 'batchSize' and 'cacheBatchSize'?
The entire dataset is passed through the model and stored for the nearest neighbour retrieval, this is the caching step. Caching doesn't require any triplets, so there the cacheBatchSize is exactly how many images get passed.
For training it requires triplets, so if the batchSize is 4, then it actually passes 12 images through the network (1 anchor, 1 positive, and 1 negative). At which stage do you get the memory error, training or caching?
Thanks for your patience. It occurs the error at training stage.
/media/hdd/yike/anaconda3/envs/pytorch-netvlad/lib/python3.7/site-packages/sklearn/neighbors/base.py:421: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1
n_jobs = effective_n_jobs(self.n_jobs)
/media/hdd/yike/anaconda3/envs/pytorch-netvlad/lib/python3.7/site-packages/sklearn/neighbors/base.py:421: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1
n_jobs = effective_n_jobs(self.n_jobs)
/media/hdd/yike/anaconda3/envs/pytorch-netvlad/lib/python3.7/site-packages/sklearn/neighbors/base.py:421: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1
n_jobs = effective_n_jobs(self.n_jobs)
Traceback (most recent call last):
File "main.py", line 518, in <module>
train(epoch)
File "main.py", line 129, in train
image_encoding = model.encoder(input)
File "/media/hdd/yike/anaconda3/envs/pytorch-netvlad/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/media/hdd/yike/anaconda3/envs/pytorch-netvlad/lib/python3.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/media/hdd/yike/anaconda3/envs/pytorch-netvlad/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/media/hdd/yike/anaconda3/envs/pytorch-netvlad/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: out of memory
Like aboved.
What happens if you set batchSize to 1?
And what is the output from lines 109 and 110?
It works right. It is error with my graphics card. Thx for your patience.
The out of memory problem was eliminated when pin_memory
was replaced with False
among some of the DataLoader
functions.
example ) in train(epoch)
training_data_loader = DataLoader(dataset=sub_train_set, num_workers=opt.threads, batch_size=opt.batchSize, shuffle=True, collate_fn=dataset.collate_fn, pin_memory=False)
What a remarkable work it is! But when I am running the code, it always has the error "Out of Memory".
I guess it is the vesion problem. And it has the warning followed.
Can you tell the versions about followed packages or give me some advice to settle the problem?