BaguaSys / bagua

Bagua Speeds up PyTorch
https://tutorials-8ro.pages.dev/
MIT License
872 stars 83 forks source link

RuntimeError: TensorError("duplicated tensor detected, name transformer.wte.weight, ptr 140681426914304") #137

Closed elricwan closed 3 years ago

elricwan commented 3 years ago

I follow the code instruction and run my gpt2 model with bagua. 1 node 2 gpus. But I got this error. My code works on pure pytorch distribution environment. Here is the source code:

self.model = self.model.cuda()
param_optimizer = list(self.model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
  {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': self.args.weight_decay},
  {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}]
optimizer = Adam_GC(optimizer_grouped_parameters, lr=self.args.lr)

ddp_model = self.model.with_bagua([optimizer], algorithm)

Here is the full error messege:

Traceback (most recent call last):
  File "agent_gpu_advanced.py", line 315, in <module>
    main()
  File "agent_gpu_advanced.py", line 310, in main
    dialo.train()
  File "agent_gpu_advanced.py", line 218, in train
    self.model = self.model.with_bagua([optimizer], algorithm)
  File "/home/protago/miniconda3/envs/bagua/lib/python3.8/site-packages/bagua/torch_api/distributed.py", line 288, in with_bagua
    self._bagua_init_algorithm()
  File "/home/protago/miniconda3/envs/bagua/lib/python3.8/site-packages/bagua/torch_api/distributed.py", line 338, in _bagua_init_algorithm
    self._bagua_reset_algorithm_buckets()
  File "/home/protago/miniconda3/envs/bagua/lib/python3.8/site-packages/bagua/torch_api/distributed.py", line 403, in _bagua_reset_algorithm_buckets
    self._bagua_backend.register_ordered_buckets(
RuntimeError: TensorError("duplicated tensor detected, name transformer.wte.weight, ptr 139610570771456")
Traceback (most recent call last):
  File "agent_gpu_advanced.py", line 315, in <module>
    main()
  File "agent_gpu_advanced.py", line 310, in main
    dialo.train()
  File "agent_gpu_advanced.py", line 218, in train
    self.model = self.model.with_bagua([optimizer], algorithm)
  File "/home/protago/miniconda3/envs/bagua/lib/python3.8/site-packages/bagua/torch_api/distributed.py", line 288, in with_bagua
    self._bagua_init_algorithm()
  File "/home/protago/miniconda3/envs/bagua/lib/python3.8/site-packages/bagua/torch_api/distributed.py", line 338, in _bagua_init_algorithm
    self._bagua_reset_algorithm_buckets()
  File "/home/protago/miniconda3/envs/bagua/lib/python3.8/site-packages/bagua/torch_api/distributed.py", line 403, in _bagua_reset_algorithm_buckets
    self._bagua_backend.register_ordered_buckets(
RuntimeError: TensorError("duplicated tensor detected, name transformer.wte.weight, ptr 140024129145856")
Killing subprocess 67546
Killing subprocess 67547

Can anyone help? thank you!

elricwan commented 3 years ago

The problem comes from self.transformer = GPT2Model(config) self.lm_head = GPT2LMHead(self.transformer.wte.weight, config)

In gpt2, the self.transformer.wte.weight would be used twice.

How to use bagua with gpt2?

NOBLES5E commented 3 years ago

@liuhatry please help take a look

liuhatry commented 3 years ago

Currently,Bagua does not support duplicated tensors. We will develop this feature as soon as possible.

liuhatry commented 3 years ago

@elricwan We have fixed the problem. Please try master branch and let us know if there are any other issues :)

python3 -m pip install git+https://github.com/BaguaSys/bagua.git

It will be available in next release (0.7).

elricwan commented 3 years ago

Thanks