When I train with a single GPU, it works!
But when I want to use multi GPUs for training, there is a error(batch size, num_ workers have not changed).
Here is the specific cmd line:
CUDA_VISIBLE_DEVICES=0,1,2,3 mmf_run config=projects/m4c/configs/textvqa/defaults.yaml datasets=textvqa model=m4c run_type=train_val
Here is the specific log:
`2021-07-15T21:17:40 | INFO | mmf_cli.run : Namespace(config_override=None, local_rank=None, opts=['config=projects/m4c/configs/textvqa/defaults.yaml', 'datasets=textvqa', 'model=m4c', 'env.save_dir=/data/zhujj/projects/mmf_2080TI/save/test', 'run_type=train_val'])
2021-07-15T21:17:40 | INFO | mmf_cli.run : Torch version: 1.6.0
2021-07-15T21:17:40 | INFO | mmf.utils.general : CUDA Device 0 is: GeForce RTX 2080 Ti
2021-07-15T21:17:40 | INFO | mmf_cli.run : Using seed 40937610
2021-07-15T21:17:40 | INFO | mmf.trainers.mmf_trainer : Loading datasets
2021-07-15T21:18:53 | INFO | mmf.datasets.multi_datamodule : Multitasking disabled by default for single dataset training
2021-07-15T21:18:53 | INFO | mmf.datasets.multi_datamodule : Multitasking disabled by default for single dataset training
2021-07-15T21:18:53 | INFO | mmf.datasets.multi_datamodule : Multitasking disabled by default for single dataset training
2021-07-15T21:18:53 | INFO | mmf.trainers.mmf_trainer : Loading model
2021-07-15T21:18:57 | INFO | mmf.trainers.mmf_trainer : Loading optimizer
2021-07-15T21:18:57 | INFO | mmf.trainers.mmf_trainer : Loading metrics
2021-07-15T21:18:57 | WARNING | py.warnings : /data/zhujj/projects/mmf_2080TI/mmf/utils/distributed.py:396: UserWarning: No type for scheduler specified even though lr_scheduler is True, setting default to 'Pythia'
builtin_warn(*args, **kwargs)
2021-07-15T21:18:57 | WARNING | py.warnings : /data/zhujj/projects/mmf_2080TI/mmf/utils/distributed.py:396: UserWarning: No type for scheduler specified even though lr_scheduler is True, setting default to 'Pythia'
builtin_warn(*args, **kwargs)
2021-07-15T21:18:57 | WARNING | py.warnings : /data/zhujj/projects/mmf_2080TI/mmf/utils/distributed.py:396: UserWarning: scheduler attributes has no params defined, defaulting to {}.
builtin_warn(*args, **kwargs)
2021-07-15T21:18:57 | WARNING | py.warnings : /data/zhujj/projects/mmf_2080TI/mmf/utils/distributed.py:396: UserWarning: scheduler attributes has no params defined, defaulting to {}.
builtin_warn(*args, **kwargs)
2021-07-15T21:18:57 | INFO | mmf.trainers.core.device : Using PyTorch DistributedDataParallel
2021-07-15T21:18:57 | WARNING | py.warnings : /data/zhujj/projects/mmf_2080TI/mmf/utils/distributed.py:396: UserWarning: You can enable ZeRO and Sharded DDP, by installing fairscale and setting optimizer.enable_state_sharding=True.
builtin_warn(*args, **kwargs)
2021-07-15T21:18:57 | WARNING | py.warnings : /data/zhujj/projects/mmf_2080TI/mmf/utils/distributed.py:396: UserWarning: You can enable ZeRO and Sharded DDP, by installing fairscale and setting optimizer.enable_state_sharding=True.
builtin_warn(*args, **kwargs)
-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/data/zhujj/projects/mmf_2080TI/mmf_cli/run.py", line 66, in distributed_main
main(configuration, init_distributed=True, predict=predict)
File "/data/zhujj/projects/mmf_2080TI/mmf_cli/run.py", line 56, in main
trainer.train()
File "/data/zhujj/projects/mmf_2080TI/mmf/trainers/mmf_trainer.py", line 142, in train
self.training_loop()
File "/data/zhujj/projects/mmf_2080TI/mmf/trainers/core/training_loop.py", line 33, in training_loop
self.run_training_epoch()
File "/data/zhujj/projects/mmf_2080TI/mmf/trainers/core/training_loop.py", line 77, in run_training_epoch
for idx, batch in enumerate(self.train_loader):
File "/data/zhujj/projects/mmf_2080TI/mmf/datasets/multi_dataset_loader.py", line 161, in iter
self.iterators[key] = iter(loader)
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 291, in iter
return _MultiProcessingDataLoaderIter(self)
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 737, in init
w.start()
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init
self._launch(process_obj)
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
MemoryError
Traceback (most recent call last):
File "", line 1, in
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
❓ Questions and Help
When I train with a single GPU, it works! But when I want to use multi GPUs for training, there is a error(batch size, num_ workers have not changed).
Here is the specific cmd line:
CUDA_VISIBLE_DEVICES=0,1,2,3 mmf_run config=projects/m4c/configs/textvqa/defaults.yaml datasets=textvqa model=m4c run_type=train_val
Here is the specific log:
`2021-07-15T21:17:40 | INFO | mmf_cli.run : Namespace(config_override=None, local_rank=None, opts=['config=projects/m4c/configs/textvqa/defaults.yaml', 'datasets=textvqa', 'model=m4c', 'env.save_dir=/data/zhujj/projects/mmf_2080TI/save/test', 'run_type=train_val']) 2021-07-15T21:17:40 | INFO | mmf_cli.run : Torch version: 1.6.0 2021-07-15T21:17:40 | INFO | mmf.utils.general : CUDA Device 0 is: GeForce RTX 2080 Ti 2021-07-15T21:17:40 | INFO | mmf_cli.run : Using seed 40937610 2021-07-15T21:17:40 | INFO | mmf.trainers.mmf_trainer : Loading datasets 2021-07-15T21:18:53 | INFO | mmf.datasets.multi_datamodule : Multitasking disabled by default for single dataset training 2021-07-15T21:18:53 | INFO | mmf.datasets.multi_datamodule : Multitasking disabled by default for single dataset training 2021-07-15T21:18:53 | INFO | mmf.datasets.multi_datamodule : Multitasking disabled by default for single dataset training 2021-07-15T21:18:53 | INFO | mmf.trainers.mmf_trainer : Loading model 2021-07-15T21:18:57 | INFO | mmf.trainers.mmf_trainer : Loading optimizer 2021-07-15T21:18:57 | INFO | mmf.trainers.mmf_trainer : Loading metrics 2021-07-15T21:18:57 | WARNING | py.warnings : /data/zhujj/projects/mmf_2080TI/mmf/utils/distributed.py:396: UserWarning: No type for scheduler specified even though lr_scheduler is True, setting default to 'Pythia' builtin_warn(*args, **kwargs)
2021-07-15T21:18:57 | WARNING | py.warnings : /data/zhujj/projects/mmf_2080TI/mmf/utils/distributed.py:396: UserWarning: No type for scheduler specified even though lr_scheduler is True, setting default to 'Pythia' builtin_warn(*args, **kwargs)
2021-07-15T21:18:57 | WARNING | py.warnings : /data/zhujj/projects/mmf_2080TI/mmf/utils/distributed.py:396: UserWarning: scheduler attributes has no params defined, defaulting to {}. builtin_warn(*args, **kwargs)
2021-07-15T21:18:57 | WARNING | py.warnings : /data/zhujj/projects/mmf_2080TI/mmf/utils/distributed.py:396: UserWarning: scheduler attributes has no params defined, defaulting to {}. builtin_warn(*args, **kwargs)
2021-07-15T21:18:57 | INFO | mmf.trainers.core.device : Using PyTorch DistributedDataParallel 2021-07-15T21:18:57 | WARNING | py.warnings : /data/zhujj/projects/mmf_2080TI/mmf/utils/distributed.py:396: UserWarning: You can enable ZeRO and Sharded DDP, by installing fairscale and setting optimizer.enable_state_sharding=True. builtin_warn(*args, **kwargs)
2021-07-15T21:18:57 | WARNING | py.warnings : /data/zhujj/projects/mmf_2080TI/mmf/utils/distributed.py:396: UserWarning: You can enable ZeRO and Sharded DDP, by installing fairscale and setting optimizer.enable_state_sharding=True. builtin_warn(*args, **kwargs)
2021-07-15T21:19:02 | INFO | mmf.trainers.mmf_trainer : ===== Model ===== 2021-07-15T21:19:02 | INFO | mmf.trainers.mmf_trainer : DistributedDataParallel( (module): M4C( (text_bert): TextBert( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (text_bert_out_linear): Identity() (obj_faster_rcnn_fc7): FinetuneFasterRcnnFpnFc7( (lc): Linear(in_features=2048, out_features=2048, bias=True) ) (linear_obj_feat_to_mmt_in): Linear(in_features=2048, out_features=768, bias=True) (linear_obj_bbox_to_mmt_in): Linear(in_features=4, out_features=768, bias=True) (obj_feat_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (obj_bbox_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (obj_drop): Dropout(p=0.1, inplace=False) (ocr_faster_rcnn_fc7): FinetuneFasterRcnnFpnFc7( (lc): Linear(in_features=2048, out_features=2048, bias=True) ) (linear_ocr_feat_to_mmt_in): Linear(in_features=3002, out_features=768, bias=True) (linear_ocr_bbox_to_mmt_in): Linear(in_features=4, out_features=768, bias=True) (ocr_feat_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (ocr_bbox_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (ocr_drop): Dropout(p=0.1, inplace=False) (mmt): MMT( (prev_pred_embeddings): PrevPredEmbeddings( (position_embeddings): Embedding(100, 768) (token_type_embeddings): Embedding(5, 768) (ans_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ocr_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (emb_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (emb_dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (ocr_ptr_net): OcrPtrNet( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) ) (classifier): ClassifierLayer( (module): Linear(in_features=768, out_features=5000, bias=True) ) (losses): Losses( (losses): ModuleList( (0): MMFLoss( (loss_criterion): M4CDecodingBCEWithMaskLoss() ) ) ) ) ) 2021-07-15T21:19:02 | INFO | mmf.utils.general : Total Parameters: 90850184. Trained Parameters: 90850184 2021-07-15T21:19:02 | INFO | mmf.trainers.core.training_loop : Starting training... Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/home/zhujj/.pycharm_helpers/pydev/pydevd.py", line 1452, in stoptrace get_frame(), also_add_to_passed_frame=True, overwrite_prev_trace=True, dispatch_func=lambda *args:None) File "/home/zhujj/.pycharm_helpers/pydev/pydevd.py", line 1170, in exiting sys.stdout.flush() ValueError: I/O operation on closed file. Traceback (most recent call last): File "/home/zhujj/.pycharm_helpers/pydev/pydevd.py", line 1758, in
main()
File "/home/zhujj/.pycharm_helpers/pydev/pydevd.py", line 1752, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/home/zhujj/.pycharm_helpers/pydev/pydevd.py", line 1147, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/zhujj/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/data/zhujj/projects/mmf_2080TI/mmf_cli/run.py", line 137, in
run()
File "/data/zhujj/projects/mmf_2080TI/mmf_cli/run.py", line 129, in run
nprocs=config.distributed.world_size,
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 3 terminated with the following error: Traceback (most recent call last): File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/data/zhujj/projects/mmf_2080TI/mmf_cli/run.py", line 66, in distributed_main main(configuration, init_distributed=True, predict=predict) File "/data/zhujj/projects/mmf_2080TI/mmf_cli/run.py", line 56, in main trainer.train() File "/data/zhujj/projects/mmf_2080TI/mmf/trainers/mmf_trainer.py", line 142, in train self.training_loop() File "/data/zhujj/projects/mmf_2080TI/mmf/trainers/core/training_loop.py", line 33, in training_loop self.run_training_epoch() File "/data/zhujj/projects/mmf_2080TI/mmf/trainers/core/training_loop.py", line 77, in run_training_epoch for idx, batch in enumerate(self.train_loader): File "/data/zhujj/projects/mmf_2080TI/mmf/datasets/multi_dataset_loader.py", line 161, in iter self.iterators[key] = iter(loader) File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 291, in iter return _MultiProcessingDataLoaderIter(self) File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 737, in init w.start() File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init self._launch(process_obj) File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) MemoryError
Traceback (most recent call last): File "", line 1, in
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/home/zhujj/anaconda3/envs/mmf/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Process finished with exit code 1 `