dataparallel need to work on apex

seongwook-ham commented 5 years ago

similiar to #227 i already check that distributed data parallel works well. but in my case where dataset is large(>200GB) and using multigpu(8gpu) distributed dataparallel need at least 1.6TB. am i right? i have only 512GB ram. so i need to use data parallel. in same code dataparallel without apex works normally. also distribued dataparallel with apex works normally. but dataparallel with apex throws following error. Traceback (most recent call last): | 0/333 [00:00<?, ?it/s] File "run_pretrain_amp.py", line 1140, in main() File "run_pretrain_amp.py", line 1038, in main loss = model(masked_input_ids, segment_ids, input_mask, masked_lm_labels, label_ids) File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, *kwargs) File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/apex/amp/_initialize.py", line 193, in new_fwd applier(kwargs, input_caster)) File "/home/kizunasunhy/bert_seongwook_v1/modeling.py", line 713, in forward output_all_encoded_layers=False) File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/kizunasunhy/bert_seongwook_v1/modeling.py", line 641, in forward embedding_output = self.embeddings(input_ids, token_type_ids) File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/kizunasunhy/bert_seongwook_v1/modeling.py", line 208, in forward words_embeddings = self.word_embeddings(input_ids) File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, **kwargs) File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/THC/generic/THCTensorIndex.cu:519

mcarilli commented 5 years ago

Deduplicating to https://github.com/NVIDIA/apex/issues/227

seongwook-ham commented 5 years ago

In my case there exist little difference. O0 O1 O3 O4 does not work

mcarilli commented 5 years ago

Understood, sorry. I'll keep these both open to track DataParallel issues. It might be a couple weeks before we can take a detailed look at this though (https://github.com/NVIDIA/apex/issues/227#issuecomment-486045751)

seongwook-ham commented 5 years ago

i also try torch.nn.parallel.DistributedDataParallel(not apex.parallel.DistributedDataParallel) with 1node 8gpu when used without apex it works normally. but with apex there exist following error(almost same case as dataparallel with apex)

Traceback (most recent call last): | 0/6747 [00:00<?, ?it/s] File "run_pretrain_amp.py", line 1265, in main() File "run_pretrain_amp.py", line 1127, in main loss = model(masked_input_ids, segment_ids, input_mask, masked_lm_labels, label_ids) File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 378, in forward outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs) File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 399, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, *kwargs) File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/site-packages/apex/amp/_initialize.py", line 194, in new_fwd applier(kwargs, input_caster)) File "/mnt/d/nlp_temp/bert_seongwook_v2/modeling.py", line 713, in forward output_all_encoded_layers=False) File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/mnt/d/nlp_temp/bert_seongwook_v2/modeling.py", line 641, in forward embedding_output = self.embeddings(input_ids, token_type_ids) File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/mnt/d/nlp_temp/bert_seongwook_v2/modeling.py", line 208, in forward words_embeddings = self.word_embeddings(input_ids) File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, **kwargs) File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 117, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/site-packages/torch/nn/functional.py", line 1506, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/generic/THCTensorIndex.cu:521 Traceback (most recent call last): File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in main() File "/home/seong-wook/anaconda3/envs/pytorch110/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/home/seong-wook/anaconda3/envs/pytorch110/bin/python', '-u', 'run_pretrain_amp.py', '--local_rank=0', '--gradient_accumulation_steps', '8', '--fp16']' returned non-zero exit status 1.

seongwook-ham commented 5 years ago

i find that old api(FP16_Optimizer) works well with nn.dataparalllel i hope new api(amp) also works well with nn.dataparalllel

BramVanroy commented 5 years ago

I can confirm this issue. Running Python 3.7, PyTorch 1.2, CUDA 10.1. Hardware is 4x V100. The issue arises for O2 but not for O1 . Perhaps interestingly, I also get the error for torch.embedding just like OP. Full trace:

Traceback (most recent call last):
  File "predict.py", line 279, in <module>
    predictor.predict()
  File "predict.py", line 140, in predict
    best_model_f, fig = trainer.train(epochs=opts['training']['epochs'])
  File "/home/bram/Python/projects/transformer-classifiers/transformer_classifiers/trainer.py", line 249, in train
    train_loss, train_results = self._process('train', epoch)
  File "/home/bram/Python/projects/transformer-classifiers/transformer_classifiers/trainer.py", line 333, in _process
    preds = self.model(input_ids, attention_mask=input_mask)
  File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/apex/amp/_initialize.py", line 194, in new_fwd
    **applier(kwargs, input_caster))
  File "/home/bram/Python/projects/transformer-classifiers/transformer_classifiers/transformer_models.py", line 110, in forward
    out = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
  File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/transformers/modeling_openai.py", line 418, in forward
    inputs_embeds = self.tokens_embed(input_ids)
  File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/bram/.local/share/virtualenvs/transformer-classifiers-x27iJBv7/lib/python3.7/site-packages/torch/nn/functional.py", line 1467, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:397

vadimkantorov commented 4 years ago

Same issue

vadimkantorov commented 4 years ago

I found the root cause: forward must be patched after DataParallel(...) call (because otherwise the patched method refers the old model object and not the dynamically created replica). Maybe some other patching way exists that would work fine with DP, but definitely not the straightforward way in https://github.com/NVIDIA/apex/blob/master/apex/amp/_initialize.py#L201

The workaround I found:

model = apex.amp.initialize(torch.nn.Sequential(model), opt_level = 'O2')[0]
model = torch.nn.DataParallel(model, device_ids = args.devices)
model.forward = lambda *args, old_fwd = model.forward, input_caster = lambda tensor: tensor.to(apex.amp._amp_state.opt_properties.options['cast_model_type']), output_caster = lambda tensor: tensor.to(apex.amp._amp_state.opt_properties.options['cast_model_outputs'] if apex.amp._amp_state.opt_properties.options.get('cast_model_outputs') is not None else torch.float32), **kwargs: apex.amp._initialize.applier(old_fwd(*apex.amp._initialize.applier(args, input_caster), **apex.amp._initialize.applier(kwargs, input_caster)), output_caster)

@mcarilli

mcarilli commented 4 years ago

This is very useful information and I haven't been ignoring it, but to be honest I'm probably not going to implement a fix in Apex soon. My absolute top priority right now is getting automatic mixed precision into Pytorch natively, which will eliminate all extension building/version matching issues. I'm taking care to ensure the native integration will support DistributedDataParallel, DataParallel, and model parallel usage. We are targeting the 1.5 release: pytorch/pytorch#25081 Gradient scaling and autocasting will be independently-usable components. The gradient scaling PR is mature, awaiting final documentation review: pytorch/pytorch#26512 The autocasting PR is about 3/4 done in terms of op coverage: pytorch/pytorch#29552 Autocasting will likely be exposed via a context manager that can be used to locally enable/disable mixed precision for any desired regions of the model.

If you are having problems with the current incarnation of Apex, my best advice is to wait for the PRs to be merged. Getting native mixed precision support as soon as possible is the best path forward for everyone IMO.

vadimkantorov commented 4 years ago

I've seen these PRs and hope they are merged soon :)

vadimkantorov commented 4 years ago

If you are doing some forward patching in those PR's the same issue may bite you there too...

mcarilli commented 4 years ago

The upstream integration does not use patching. It does not directly alter methods or attributes of model objects. But it's crucial to be aware of that as a potential issue, so I appreciate the help!

NVIDIA / apex

dataparallel need to work on apex #269