Nanne / pytorch-NetVlad

Pytorch implementation of NetVlad including training on Pittsburgh.
428 stars 110 forks source link

Question about the error “TypeError: Caught TypeError in DataLoader worker process 0.” #42

Closed sqw475sqw closed 3 years ago

sqw475sqw commented 3 years ago

Thanks for your great work!

I get errors after run the instruction: python main.py --mode=train --arch=vgg16 --pooling=netvlad --num_clusters=64

the errors as follows: `====> Building Cache Allocated: 60039168 Cached: 9596567552 /home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/sklearn/neighbors/_base.py:622: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1 n_jobs = effective_n_jobs(self.n_jobs) /home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/sklearn/neighbors/_base.py:622: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1 n_jobs = effective_n_jobs(self.n_jobs) /home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/sklearn/neighbors/_base.py:622: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1 n_jobs = effective_n_jobs(self.n_jobs) /home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/sklearn/neighbors/_base.py:622: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1 n_jobs = effective_n_jobs(self.n_jobs) /home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/sklearn/neighbors/_base.py:622: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1 n_jobs = effective_n_jobs(self.n_jobs) /home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/sklearn/neighbors/_base.py:622: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1 n_jobs = effective_n_jobs(self.n_jobs) /home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/sklearn/neighbors/_base.py:622: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1 n_jobs = effective_n_jobs(self.n_jobs) /home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/sklearn/neighbors/_base.py:622: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1 n_jobs = effective_n_jobs(self.n_jobs) Traceback (most recent call last): File "main.py", line 515, in train(epoch) File "main.py", line 116, in train negCounts, indices) in enumerate(training_data_loader, startIter): File "/home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data return self._process_data(data) File "/home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data data.reraise() File "/home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/torch/utils/data/dataset.py", line 257, in getitem return self.dataset[self.indices[idx]] File "/home/sqw/Desktop/pytorch-NetVlad-master/pittsburgh.py", line 230, in getitem negFeat = h5feat[negSample.tolist()] File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 777, in getitem selection = sel.select(self.shape, args, dataset=self) File "/home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/h5py/_hl/selections.py", line 82, in select return selector.make_selection(args) File "h5py/_selector.pyx", line 272, in h5py._selector.Selector.make_selection File "h5py/_selector.pyx", line 183, in h5py._selector.Selector.apply_args TypeError: Indexing arrays must have integer dtypes

/home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/sklearn/neighbors/_base.py:622: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1 n_jobs = effective_n_jobs(self.n_jobs) /home/sqw/anaconda3/envs/pytorch1.4.0/lib/python3.7/site-packages/sklearn/neighbors/_base.py:622: UserWarning: Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1 n_jobs = effective_n_jobs(self.n_jobs) `

I doubt if the pitts datasets is misplaced.

Nanne commented 3 years ago

Your line numbers dont match up with the repo. I assume whatever change you made around here: https://github.com/Nanne/pytorch-NetVlad/blob/master/main.py#L85-L101 makes it so the training triplet gets stuck on trying to index the hdf5.

sqw475sqw commented 3 years ago

thanks for your reply.

I don’t seem to modify those codes you mention above. I doubt if the cachePatch path is wrong,because I have changed the instruction :python main.py --mode=train --arch=vgg16 --pooling=netvlad --num_clusters=64to the instruction: python main.py --mode=train --arch=vgg16 --pooling=netvlad --num_clusters=64 --cachePath=/home/sqw/Desktop/pytorch-NetVlad-master and I have changed your codes in main.py parser.add_argument('--cachePath', type=str, default=environ['TMPDIR'], help='Path to save cache to.') to parser.add_argument('--cachePath', type=str, help='Path to save cache to.')when I get error as follows:keyerror: 'tmpdir'

sqw475sqw commented 3 years ago

As details,As soon as I started running the construction: python main.py --mode=cluster --arch=vgg16 --pooling=netvlad --num_clusters=64 an error was reported as follows: parser.add_argument('--cachePath', type=str, default=environ['TMPDIR'], help='Path to save cache to.') File "/home/sqw/anaconda3/envs/pytorch1.4/lib/python3.7/os.py", line 681, in __getitem__ raise KeyError(key) from None KeyError: 'TMPDIR'

what should I do?could you give me some hints?

Nanne commented 3 years ago

The errors you're getting are quite descriptive. What's happening is that you need to specify a path where the feature cache can be stored. This is the --cachePath argument. By default it tries to use the location of the tmp dir, as found in environ['TMPDIR']. If this environment variable isn't specified then you might get a key error.

To circumvent this error you can call main.py with a valid path for the cachePath, for example you can overwrite the default by calling main.py with --cachePath=. to use the current directory for caching.

Anyway, this is not a support forum for how to use this code base (especially not when you've started modifying it), please only open an issue to report bugs etc.

Fleurrr commented 3 years ago

I think the h5py version caused the problem. When I use the 3.5.1 version of h5py, the same error occurs. The 2.8.0 version of h5py can solve this problem!