HPDL-Group / Merak

Apache License 2.0
69 stars 9 forks source link

RuntimeError: Tried to erase Node attention_mask_1 but it still had 1 users in the graph: {_assert_is_none: None}! #1

Closed QiaolingChen00 closed 1 year ago

QiaolingChen00 commented 2 years ago

RuntimeError: Tried to erase Node attention_mask_1 but it still had 1 users in the graph: {_assert_is_none: None}!

follow the Merak/examples/language-modeling/README.md ,and use the command and did not change any code in merak:

python -m torch.distributed.launch --nproc_per_node=4  run_bert.py \
                --model-name bert-large-uncased  \
                --data-files ./train_context.csv \
                --cache-dir ./bert_cache \
                --output_dir ./output \
                --remove_unused_columns false \
                --per_device_train_batch_size 4 --gradient_accumulation_steps 4

but had the following error:

[2022-09-03 11:35:52,489] [INFO] [checkpointing.py:207:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2760 and data parallel seed: 42
Traceback (most recent call last):
  File "/users/hqh/Merak/examples/language-modeling/run_bert.py", line 99, in <module>
    main()
  File "/users/hqh/Merak/examples/language-modeling/run_bert.py", line 93, in main
    train_result = trainer.train()
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/train_func.py", line 131, in train
    self.create_optimizer_and_scheduler(num_training_steps=max_steps)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/merak_trainer.py", line 202, in create_optimizer_and_scheduler
    model, model_layers, input_to_shard_dic = convert_to_sequential(self.model, self.args, self.leaf_modules)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/autoshard/convert.py", line 102, in convert_to_sequential
    traced, dummy_inputs = symbolic_trace(
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/autoshard/convert.py", line 342, in symbolic_trace
    traced_graph = tracer.trace(model, concrete_args=concrete_args)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/transformers/utils/fx.py", line 387, in trace
    graph.erase_node(node)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/torch/fx/graph.py", line 761, in erase_node
    raise RuntimeError(f'Tried to erase Node {to_erase} but it still had {len(to_erase.users)} '
RuntimeError: Tried to erase Node attention_mask_1 but it still had 1 users in the graph: {_assert_is_none: None}!
Traceback (most recent call last):
  File "/users/hqh/Merak/examples/language-modeling/run_bert.py", line 99, in <module>
    main()
  File "/users/hqh/Merak/examples/language-modeling/run_bert.py", line 93, in main
    train_result = trainer.train()
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/train_func.py", line 131, in train
    self.create_optimizer_and_scheduler(num_training_steps=max_steps)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/merak_trainer.py", line 202, in create_optimizer_and_scheduler
    model, model_layers, input_to_shard_dic = convert_to_sequential(self.model, self.args, self.leaf_modules)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/autoshard/convert.py", line 102, in convert_to_sequential
    traced, dummy_inputs = symbolic_trace(
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/autoshard/convert.py", line 342, in symbolic_trace
    traced_graph = tracer.trace(model, concrete_args=concrete_args)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/transformers/utils/fx.py", line 387, in trace
    graph.erase_node(node)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/torch/fx/graph.py", line 761, in erase_node
    raise RuntimeError(f'Tried to erase Node {to_erase} but it still had {len(to_erase.users)} '
RuntimeError: Tried to erase Node attention_mask_1 but it still had 1 users in the graph: {_assert_is_none: None}!
Traceback (most recent call last):
  File "/users/hqh/Merak/examples/language-modeling/run_bert.py", line 99, in <module>
    main()
  File "/users/hqh/Merak/examples/language-modeling/run_bert.py", line 93, in main
    train_result = trainer.train()
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/train_func.py", line 131, in train
    self.create_optimizer_and_scheduler(num_training_steps=max_steps)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/merak_trainer.py", line 202, in create_optimizer_and_scheduler
    model, model_layers, input_to_shard_dic = convert_to_sequential(self.model, self.args, self.leaf_modules)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/autoshard/convert.py", line 102, in convert_to_sequential
    traced, dummy_inputs = symbolic_trace(
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/autoshard/convert.py", line 342, in symbolic_trace
    traced_graph = tracer.trace(model, concrete_args=concrete_args)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/transformers/utils/fx.py", line 387, in trace
    graph.erase_node(node)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/torch/fx/graph.py", line 761, in erase_node
    raise RuntimeError(f'Tried to erase Node {to_erase} but it still had {len(to_erase.users)} '
RuntimeError: Tried to erase Node attention_mask_1 but it still had 1 users in the graph: {_assert_is_none: None}!
Traceback (most recent call last):
  File "/users/hqh/Merak/examples/language-modeling/run_bert.py", line 99, in <module>
    main()
  File "/users/hqh/Merak/examples/language-modeling/run_bert.py", line 93, in main
    train_result = trainer.train()
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/train_func.py", line 131, in train
    self.create_optimizer_and_scheduler(num_training_steps=max_steps)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/merak_trainer.py", line 202, in create_optimizer_and_scheduler
    model, model_layers, input_to_shard_dic = convert_to_sequential(self.model, self.args, self.leaf_modules)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/autoshard/convert.py", line 102, in convert_to_sequential
    traced, dummy_inputs = symbolic_trace(
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/Merak/autoshard/convert.py", line 342, in symbolic_trace
    traced_graph = tracer.trace(model, concrete_args=concrete_args)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/transformers/utils/fx.py", line 387, in trace
    graph.erase_node(node)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/torch/fx/graph.py", line 761, in erase_node
    raise RuntimeError(f'Tried to erase Node {to_erase} but it still had {len(to_erase.users)} '
RuntimeError: Tried to erase Node attention_mask_1 but it still had 1 users in the graph: {_assert_is_none: None}!
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 25830) of binary: /users/hqh/miniconda3/bin/python
Traceback (most recent call last):
  File "/users/hqh/miniconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/users/hqh/miniconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/users/hqh/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_bert.py FAILED
lucasleesw commented 2 years ago

@Chenqll Could you add a parameter --wall_clock_breakdown True in your command for debugging then attach a full running log and your running environment information (e.g. package versions of transformers and pytorch). Thanks.

insujang commented 1 year ago

Hi. I face the same error when I try to run GPT. --wall_clock_breakdown True doesn't make any meaningful difference, but print the structure of the model and return the same error.

I am using PyTorch 1.12.1 + CUDA 11.6, and Transformer 4.15.0.

lucasleesw commented 1 year ago

@insujang Hi, thanks for using, we think it might be caused by this. Please try to downgrade PyTorch to 1.10, and we will try torch 1.12 very soon and fix it in next PR.

insujang commented 1 year ago

@lucasleesw Thanks for the quick response! I confirm that it successfully runs on old versions of PyTorch (tested 1.10.0 and 1.11.0) in my system.