Automatic termination of training

SKT-T1-Thecai commented 1 year ago

Hi, I want to train an EN-ZH translation model using Sockeye3. But the program always terminates after a while. The log information are [INFO:sockeye.training] E=242 B=23800 s/sec=1530.55 tok/sec=37447.98 u/sec=3.16 ppl=4.787584 [INFO:sockeye.training] E=243 B=23850 s/sec=1357.29 tok/sec=37276.66 u/sec=3.15 ppl=4.787056 [INFO:sockeye.training] E=243 B=23900 s/sec=1308.87 tok/sec=37354.04 u/sec=3.16 ppl=4.786665 WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 412448 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 412449 closing signal SIGHUP ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{ "message": { "message": "SignalException: Process 412416 got signal: 1", "extraInfo": { "py_callstack": "Traceback (most recent call last):\n File \"/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 345, in wrapper\n return f(*args, **kwargs)\n File \"/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py\", line 724, in main\n run(args)\n File \"/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py\", line 715, in run\n elastic_launch(\n File \"/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 131, in __call__\n return launch_agent(self._config, self._entrypoint, list(args))\n File \"/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 236, in launch_agent\n result = agent.run()\n File \"/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 709, in run\n result = self._invoke_run(role)\n File \"/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 850, in _invoke_run\n time.sleep(monitor_interval)\n File \"/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py\", line 60, in _terminate_process_handler\n raise SignalException(f\"Process {os.getpid()} got signal: {sigval}\", sigval=sigval)\ntorch.distributed.elastic.multiprocessing.api.SignalException: Process 412416 got signal: 1\n", "timestamp": "1689589558" } } } Traceback (most recent call last): File "/root/miniconda3/bin/torchrun", line 8, in <module> sys.exit(main()) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent result = agent.run() File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper result = f(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run result = self._invoke_run(role) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run time.sleep(monitor_interval) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 412416 got signal: 1 My training command is CUDA_VISIBLE_DEVICES=0,1 torchrun \ --no_python \ --nproc_per_node 2 \ sockeye-train \ --prepared-data /root/autodl-tmp/multilingual_models/en-zh-v1/en-zh.prepared/prepared_data \ --validation-source /root/autodl-tmp/multilingual_models/en-zh-v1/en-zh.prepared/valid_src.src \ --validation-target /root/autodl-tmp/multilingual_models/en-zh-v1/en-zh.prepared/valid_zh.zh \ --output /root/autodl-tmp/multilingual_models/en-zh-v1/en-zh.prepared/sockeye3_enzh \ --num-layers 6 \ --transformer-model-size 512 \ --transformer-attention-heads 8 \ --transformer-feed-forward-num-hidden 2048 \ --batch-type max-word \ --batch-size 12000 \ --update-interval 1 \ --decode-and-evaluate -1 \ --checkpoint-interval 10000 \ --dist \ --optimizer-betas 0.9:0.98 \ --initial-learning-rate 0.06325 \ --learning-rate-scheduler-type inv-sqrt-decay \ --learning-rate-warmup 4000 \ --max-num-checkpoint-not-improved 8 \ --seed 20230715 \ --quiet-secondary-workers > /root/autodl-tmp/multilingual_models/en-zh-v1/en-zh.prepared/sockeye3.log 2>&1 & and the pytorch version is 1.11.0+cu113, is there any advice ? thank you.

mjdenkowski commented 1 year ago

Do you see an error if you run single-process training? For testing, you can remove torchrun --no_python --nproc_per_node 2 from your command.

SKT-T1-Thecai commented 1 year ago

Thanks, I have solved it by running the command in tmux, and I removed "nohup" and the redirection suffix in my command.

awslabs / sockeye

Automatic termination of training #1092