Closed seongwook-ham closed 4 years ago
Deduplicating to https://github.com/NVIDIA/apex/issues/227
In my case there exist little difference. O0 O1 O3 O4 does not work
Understood, sorry. I'll keep these both open to track DataParallel issues. It might be a couple weeks before we can take a detailed look at this though (https://github.com/NVIDIA/apex/issues/227#issuecomment-486045751)
i also try torch.nn.parallel.DistributedDataParallel(not apex.parallel.DistributedDataParallel) with 1node 8gpu when used without apex it works normally. but with apex there exist following error(almost same case as dataparallel with apex)
Traceback (most recent call last): | 0/6747 [00:00<?, ?it/s]
File "run_pretrain_amp.py", line 1265, in
i find that old api(FP16_Optimizer) works well with nn.dataparalllel i hope new api(amp) also works well with nn.dataparalllel
I can confirm this issue. Running Python 3.7, PyTorch 1.2, CUDA 10.1. Hardware is 4x V100. The issue arises for O2
but not for O1
. Perhaps interestingly, I also get the error for torch.embedding
just like OP. Full trace:
Traceback (most recent call last):
File "predict.py", line 279, in <module>
predictor.predict()
File "predict.py", line 140, in predict
best_model_f, fig = trainer.train(epochs=opts['training']['epochs'])
File "/home/bram/Python/projects/transformer-classifiers/transformer_classifiers/trainer.py", line 249, in train
train_loss, train_results = self._process('train', epoch)
File "/home/bram/Python/projects/transformer-classifiers/transformer_classifiers/trainer.py", line 333, in _process
preds = self.model(input_ids, attention_mask=input_mask)
File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/apex/amp/_initialize.py", line 194, in new_fwd
**applier(kwargs, input_caster))
File "/home/bram/Python/projects/transformer-classifiers/transformer_classifiers/transformer_models.py", line 110, in forward
out = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/transformers/modeling_openai.py", line 418, in forward
inputs_embeds = self.tokens_embed(input_ids)
File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/functional.py", line 1467, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:397
Same issue
I found the root cause: forward must be patched after DataParallel(...) call (because otherwise the patched method refers the old model object and not the dynamically created replica). Maybe some other patching way exists that would work fine with DP, but definitely not the straightforward way in https://github.com/NVIDIA/apex/blob/master/apex/amp/_initialize.py#L201
The workaround I found:
model = apex.amp.initialize(torch.nn.Sequential(model), opt_level = 'O2')[0]
model = torch.nn.DataParallel(model, device_ids = args.devices)
model.forward = lambda *args, old_fwd = model.forward, input_caster = lambda tensor: tensor.to(apex.amp._amp_state.opt_properties.options['cast_model_type']), output_caster = lambda tensor: tensor.to(apex.amp._amp_state.opt_properties.options['cast_model_outputs'] if apex.amp._amp_state.opt_properties.options.get('cast_model_outputs') is not None else torch.float32), **kwargs: apex.amp._initialize.applier(old_fwd(*apex.amp._initialize.applier(args, input_caster), **apex.amp._initialize.applier(kwargs, input_caster)), output_caster)
@mcarilli
This is very useful information and I haven't been ignoring it, but to be honest I'm probably not going to implement a fix in Apex soon. My absolute top priority right now is getting automatic mixed precision into Pytorch natively, which will eliminate all extension building/version matching issues. I'm taking care to ensure the native integration will support DistributedDataParallel, DataParallel, and model parallel usage. We are targeting the 1.5 release: pytorch/pytorch#25081 Gradient scaling and autocasting will be independently-usable components. The gradient scaling PR is mature, awaiting final documentation review: pytorch/pytorch#26512 The autocasting PR is about 3/4 done in terms of op coverage: pytorch/pytorch#29552 Autocasting will likely be exposed via a context manager that can be used to locally enable/disable mixed precision for any desired regions of the model.
If you are having problems with the current incarnation of Apex, my best advice is to wait for the PRs to be merged. Getting native mixed precision support as soon as possible is the best path forward for everyone IMO.
I've seen these PRs and hope they are merged soon :)
If you are doing some forward patching in those PR's the same issue may bite you there too...
The upstream integration does not use patching. It does not directly alter methods or attributes of model objects. But it's crucial to be aware of that as a potential issue, so I appreciate the help!
similiar to #227 i already check that distributed data parallel works well. but in my case where dataset is large(>200GB) and using multigpu(8gpu) distributed dataparallel need at least 1.6TB. am i right? i have only 512GB ram. so i need to use data parallel. in same code dataparallel without apex works normally. also distribued dataparallel with apex works normally. but dataparallel with apex throws following error. Traceback (most recent call last): | 0/333 [00:00<?, ?it/s] File "run_pretrain_amp.py", line 1140, in
main()
File "run_pretrain_amp.py", line 1038, in main
loss = model(masked_input_ids, segment_ids, input_mask, masked_lm_labels, label_ids)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, kwargs)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, *kwargs)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(input, kwargs)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/apex/amp/_initialize.py", line 193, in new_fwd
applier(kwargs, input_caster))
File "/home/kizunasunhy/bert_seongwook_v1/modeling.py", line 713, in forward
output_all_encoded_layers=False)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, *kwargs)
File "/home/kizunasunhy/bert_seongwook_v1/modeling.py", line 641, in forward
embedding_output = self.embeddings(input_ids, token_type_ids)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(input, kwargs)
File "/home/kizunasunhy/bert_seongwook_v1/modeling.py", line 208, in forward
words_embeddings = self.word_embeddings(input_ids)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/THC/generic/THCTensorIndex.cu:519