steps: 0%| | 0/234 [00:00<?, ?it/s]
epoch 1/3
rank1: Traceback (most recent call last):
rank1: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 529, in
rank1: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 343, in train
rank1: encoder_hidden_states = train_util.get_hidden_states(
rank1: File "/workspace/lora-scripts/scripts/stable/library/train_util.py", line 4427, in get_hidden_states
rank1: encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
rank1: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1729, in getattrrank1: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
rank0: Traceback (most recent call last):
rank0: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 529, in
rank0: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 343, in train
rank0: encoder_hidden_states = train_util.get_hidden_states(
rank0: File "/workspace/lora-scripts/scripts/stable/library/train_util.py", line 4427, in get_hidden_states
rank0: encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
rank0: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1729, in getattrrank0: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
steps: 0%| | 0/234 [00:01<?, ?it/s]
W1010 02:41:57.596000 135260147754816 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3147 closing signal SIGTERM
E1010 02:41:57.711000 135260147754816 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 3148) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1116, in
main()
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1112, in main
launch_command(args)
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./scripts/stable/train_db.py FAILED
Failures:
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-10-10_02:41:57
host : ubuntu-Super-Server
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3148)
error_file:
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
02:41:58-146594 ERROR Training failed / 训练失败
steps: 0%| | 0/234 [00:00<?, ?it/s] epoch 1/3 rank1: Traceback (most recent call last): rank1: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 529, in
rank1: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 343, in train rank1: encoder_hidden_states = train_util.get_hidden_states(
rank1: File "/workspace/lora-scripts/scripts/stable/library/train_util.py", line 4427, in get_hidden_states rank1: encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
rank1: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1729, in getattr rank1: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
rank0: Traceback (most recent call last): rank0: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 529, in
rank0: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 343, in train rank0: encoder_hidden_states = train_util.get_hidden_states(
rank0: File "/workspace/lora-scripts/scripts/stable/library/train_util.py", line 4427, in get_hidden_states rank0: encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
rank0: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1729, in getattr rank0: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
steps: 0%| | 0/234 [00:01<?, ?it/s] W1010 02:41:57.596000 135260147754816 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3147 closing signal SIGTERM E1010 02:41:57.711000 135260147754816 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 3148) of binary: /root/miniconda3/bin/python Traceback (most recent call last): File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1116, in
main()
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1112, in main
launch_command(args)
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./scripts/stable/train_db.py FAILED
Failures: