VICO-UoE / URL

Universal Representation Learning from Multiple Domains for Few-shot Classification - ICCV 2021, Cross-domain Few-shot Learning with Task-specific Adapters - CVPR 2022
MIT License
127 stars 18 forks source link

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1) #9

Closed lourencobt closed 2 years ago

lourencobt commented 2 years ago

Hello,

I've trained the sdl networks from scratch and I was trying to train the URL model from scratch then. The program starts fine, but suddenly it breaks with this error:

2022-06-07 11:38:17.964494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29879 MB memory: -> device: 0, name: Tesla V100S-PCIE-32GB, pci bus id: 0000:00:05.0, compute capability: 7.0 0%| | 0/240000 [00:00<?, ?it/s]2022-06-07 11:38:19.286408: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled 2022-06-07 11:38:47.998087: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 169 of 1000 2022-06-07 11:38:52.983647: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:415] Shuffle buffer filled. 2022-06-07 11:39:26.450546: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 116 of 1000 2022-06-07 11:39:30.279009: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:415] Shuffle buffer filled. 0%|▍ | 1007/240000 [17:43<70:06:39, 1.06s/it] Traceback (most recent call last): File "/home/guests/lbt/URL/train_net_url.py", line 224, in train() File "/home/guests/lbt/URL/train_net_url.py", line 129, in train ft, fs = torch.nn.functional.normalize(stl_features[t_indx], p=2, dim=1, eps=1e-12), torch.nn.functional.normalize(mtl_features[t_indx], p=2, dim=1, eps=1e-12) File "/home/guests/lbt/.local/bin/.virtualenvs/few-shot/lib/python3.9/site-packages/torch/nn/functional.py", line 4637, in normalize denom = input.norm(p, dim, keepdim=True).clamp_min(eps).expand_as(input) File "/home/guests/lbt/.local/bin/.virtualenvs/few-shot/lib/python3.9/site-packages/torch/_tensor.py", line 498, in norm return torch.norm(self, p, dim, keepdim, dtype=dtype) File "/home/guests/lbt/.local/bin/.virtualenvs/few-shot/lib/python3.9/site-packages/torch/functional.py", line 1590, in norm return _VF.norm(input, p, _dim, keepdim=keepdim) # type: ignore[attr-defined] IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

I wasn't able to find the problem. Can you help me?

Also, the program execution time is estimated in between 150 to 200 hours. Is it normal? Or there is something happening. I am training with a single GPU Tesla V100S-PCIE-32GB.

WeiHongLee commented 2 years ago

This may probably be caused by having only 1 sample in a batch of some datasets. Can you check the number of samples in a batch. Or you can edit ./meta-dataset/data/reader.py in the meta-dataset repository to change dataset = dataset.batch(batch_size, drop_remainder=False) to dataset = dataset.batch(batch_size, drop_remainder=True). (in our work, we drop the remainder such that we will not use very small batch for some domains).

Hope this can help! Wei-Hong

lourencobt commented 2 years ago

This may probably be caused by having only 1 sample in a batch of some datasets. Can you check the number of samples in a batch. Or you can edit ./meta-dataset/data/reader.py in the meta-dataset repository to change dataset = dataset.batch(batch_size, drop_remainder=False) to dataset = dataset.batch(batch_size, drop_remainder=True). (in our work, we drop the remainder such that we will not use very small batch for some domains).

Hope this can help! Wei-Hong

Well, I did as you said and dropping the remainder solved the problem. However, shouldn't the code execute well independent of the drop_remainder flag?

However, the training of URL continues to be very slow. Is it normal? Can you give an estimative of the time it took you?

WeiHongLee commented 2 years ago

Hi, I've updated the code and the problem should be fixed when you don't drop the remainder. As I mentioned in the README, we drop the remainder in our work and I recommend to use the same set up for reproducing the URL results.

In our experiment, it took around 48 hours for learning the URL model. The time cost would depend on the hardwares you use and you should be able to see the estimated time cost in the progress bar. You can download our pre-trained model as well.

lourencobt commented 2 years ago

Hi, I've updated the code and the problem should be fixed when you don't drop the remainder. As I mentioned in the README, we drop the remainder in our work and I recommend to use the same set up for reproducing the URL results.

In our experiment, it took around 48 hours for learning the URL model. The time cost would depend on the hardwares you use and you should be able to see the estimated time cost in the progress bar. You can download our pre-trained model as well.

Thanks a lot! I will experiment later, but I will reproduce the URL results using the same set.

Relative to the time cost, I needed just an estimative for reference. Thank you.

If I can give a suggestion, you could add to the README the time costs of each training part and the used hardware for reference.

WeiHongLee commented 2 years ago

Many thanks for the suggestions!

WeiHongLee commented 2 years ago

BTW, the training time details can be found in the supplementary in our paper (in our arXiv version, the training time details can be found in page 17).