facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.5k stars 939 forks source link

Movie+Mcan: grid feature size? (2048, 26, 19) vs (2048,25,19) #453

Closed CCYChongyanChen closed 4 years ago

CCYChongyanChen commented 4 years ago

❓ Questions and Help

Hi, I am trying to run the grid+MCAN. I extracted the grid features following https://github.com/facebookresearch/grid-feats-vqa and stored the features in .pth. Each .pth has a size [1, 2048, 26,19]. When I run the code, I mean a RuntimeError: The expanded size of the tensor (25) must match the existing size (26) at non-singleton dimension 1. Target sizes: [2048, 25, 19]. Tensor sizes: [2048, 26, 19]

The full traceback is attached.

Traceback (most recent call last): File "/home/cc67459/MMF2/bin/mmf_run", line 33, in sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_run')()) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/run.py", line 118, in run nprocs=config.distributed.world_size, File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemo n, start_method='spawn') File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/multiprocessing/spawn.py", line 158, in start_proce sses while not context.join(): File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/run.py", line 66, in distributed_main main(configuration, init_distributed=True, predict=p redict) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/run.py", line 56, in main trainer.train() File "/home/cc67459/MMF2/mmf_82/mmf/mmf/trainers/mmf trainer.py", line 108, in train self.training_loop() File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/trainers/core /training_loop.py", line 36, in training_loop self.run_training_epoch() File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/trainers/core /training_loop.py", line 67, in run_training_epoch for batch in self.train_loader: File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/utils/data/dataloader.py", line 856, in _next_data return self._process_data(data) File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/utils/data/dataloader.py", line 881, in _process_da ta data.reraise() File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker p rocess 0. Original Traceback (most recent call last): File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/utils/data/_utils/worker.py", line 178, in worker loop data = fetcher.fetch(index) File "/home/cc67459/MMF2/lib/python3.7/site-packages/t orch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/home/cc67459/MMF2/mmf_82/mmf/mmf/common/batch collator.py", line 24, in call sample_list = SampleList(batch) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/common/sample .py", line 129, in init self[field][idx] = self._get_data_copy(sample[field] ) RuntimeError: The expanded size of the tensor (25) must match the existing size (26) at non-singleton dimension 1. Target sizes: [2048, 25, 19]. Tensor sizes: [2048, 26, 19]

vedanuj commented 4 years ago

We are verifying this on our side.

CCYChongyanChen commented 4 years ago

We are verifying this on our side.

Hi @vedanuj, just check in to see if there is any update. Thanks.

vedanuj commented 4 years ago

@CCYChongyanChen The fix has been merged to master. Can you verify if it solves your issue?

CCYChongyanChen commented 4 years ago

@CCYChongyanChen The fix has been merged to master. Can you verify if it solves your issue?

Thank you for working on that! now I meet the error: Traceback (most recent call last): File "/home/cc67459/MMF2/bin/mmf_run", line 33, in sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_run')()) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/run.py", line 118, in run nprocs=config.distributed.world_size, File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

-- Process 3 terminated with the following error: Traceback (most recent call last): File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, args) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/run.py", line 66, in distributed_main main(configuration, init_distributed=True, predict=predict) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/run.py", line 56, in main trainer.train() File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/trainers/mmf_trainer.py", line 108, in train self.training_loop() File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/trainers/core/training_loop.py", line 31, in training_loop self.run_training_epoch() File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/trainers/core/training_loop.py", line 59, in run_training_epoch for batch in self.train_loader: File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data return self._process_data(data) File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data data.reraise() File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/dataset.py", line 207, in getitem return self.datasets[dataset_idx][sample_idx] File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/datasets/builders/vqa2/dataset.py", line 49, in getitem return self.load_item(idx) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/datasets/builders/vizwiz/dataset.py", line 19, in load_item sample = super().load_item(idx) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/datasets/builders/vqa2/dataset.py", line 85, in load_item features = self.features_db[idx] File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/datasets/databases/features_database.py", line 91, in getitem return self.get(image_info) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/datasets/databases/features_database.py", line 99, in get return self.from_path(feature_path) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/datasets/databases/features_database.py", line 107, in from_path features, infos = self._get_image_features_and_info(path) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/datasets/databases/features_database.py", line 80, in _get_image_features_and_info image_feats, infos = self._read_features_and_info(feat_file) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/datasets/databases/features_database.py", line 65, in _read_features_and_info feature, info = feature_reader.read(feat_file) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/datasets/databases/readers/feature_readers.py", line 81, in read return self.feat_reader.read(image_feat_path) File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/datasets/databases/readers/feature_readers.py", line 103, in read padded_feat[:, :, : h w] = feat RuntimeError: The expanded size of the tensor (100) must match the existing size (494) at non-singleton dimension 2. Target sizes: [1 , 2048, 100]. Tensor sizes: [2048, 494]

vedanuj commented 4 years ago

What is the size of your max_features in config?

CCYChongyanChen commented 4 years ago

What is the size of your max_features in config?

I didn't set the max_features in projects/movie_mcan/configs/vizwiz/defaults.yaml before your reply. I set it to 608 now. (I will update later about whether it will work or not since other team members are using GPU and it complains CUDA run out of memory now) Thanks a lot for your help!

vedanuj commented 4 years ago

@CCYChongyanChen How many GPUs are you using and what's their memory?

CCYChongyanChen commented 4 years ago

@CCYChongyanChen How many GPUs are you using and what's their memory?

Sorry for the late reply. I am using 4 GPU with 11178MB per GPU. Still having out of memory issues. I have tried to minimize the batch size but it still doesn't work.