Closed CCYChongyanChen closed 4 years ago
We are verifying this on our side.
We are verifying this on our side.
Hi @vedanuj, just check in to see if there is any update. Thanks.
@CCYChongyanChen The fix has been merged to master. Can you verify if it solves your issue?
@CCYChongyanChen The fix has been merged to master. Can you verify if it solves your issue?
Thank you for working on that! now I meet the error:
Traceback (most recent call last):
File "/home/cc67459/MMF2/bin/mmf_run", line 33, in
-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, args)
File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/run.py", line 66, in distributed_main
main(configuration, init_distributed=True, predict=predict)
File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/run.py", line 56, in main
trainer.train()
File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/trainers/mmf_trainer.py", line 108, in train
self.training_loop()
File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/trainers/core/training_loop.py", line 31, in training_loop
self.run_training_epoch()
File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/trainers/core/training_loop.py", line 59, in run_training_epoch
for batch in self.train_loader:
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
What is the size of your max_features
in config?
What is the size of your
max_features
in config?
I didn't set the max_features in projects/movie_mcan/configs/vizwiz/defaults.yaml before your reply. I set it to 608 now. (I will update later about whether it will work or not since other team members are using GPU and it complains CUDA run out of memory now) Thanks a lot for your help!
@CCYChongyanChen How many GPUs are you using and what's their memory?
@CCYChongyanChen How many GPUs are you using and what's their memory?
Sorry for the late reply. I am using 4 GPU with 11178MB per GPU. Still having out of memory issues. I have tried to minimize the batch size but it still doesn't work.
❓ Questions and Help
Hi, I am trying to run the grid+MCAN. I extracted the grid features following https://github.com/facebookresearch/grid-feats-vqa and stored the features in .pth. Each .pth has a size [1, 2048, 26,19]. When I run the code, I mean a RuntimeError: The expanded size of the tensor (25) must match the existing size (26) at non-singleton dimension 1. Target sizes: [2048, 25, 19]. Tensor sizes: [2048, 26, 19]
The full traceback is attached.
Traceback (most recent call last): File "/home/cc67459/MMF2/bin/mmf_run", line 33, in
sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_run')())
File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/run.py", line 118, in run
nprocs=config.distributed.world_size,
File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemo n, start_method='spawn')
File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/multiprocessing/spawn.py", line 158, in start_proce sses
while not context.join():
File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 2 terminated with the following error: Traceback (most recent call last): File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/run.py", line 66, in distributed_main main(configuration, init_distributed=True, predict=p redict) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/run.py", line 56, in main trainer.train() File "/home/cc67459/MMF2/mmf_82/mmf/mmf/trainers/mmf trainer.py", line 108, in train self.training_loop() File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/trainers/core /training_loop.py", line 36, in training_loop self.run_training_epoch() File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/trainers/core /training_loop.py", line 67, in run_training_epoch for batch in self.train_loader: File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/utils/data/dataloader.py", line 856, in _next_data return self._process_data(data) File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/utils/data/dataloader.py", line 881, in _process_da ta data.reraise() File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker p rocess 0. Original Traceback (most recent call last): File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/utils/data/_utils/worker.py", line 178, in worker loop data = fetcher.fetch(index) File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/home/cc67459/MMF2/mmf_82/mmf/mmf/common/batch collator.py", line 24, in call sample_list = SampleList(batch) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/common/sample .py", line 129, in init self[field][idx] = self._get_data_copy(sample[field] ) RuntimeError: The expanded size of the tensor (25) must match the existing size (26) at non-singleton dimension 1. Target sizes: [2048, 25, 19]. Tensor sizes: [2048, 26, 19]