Closed azuryl closed 3 years ago
Hi, Actually, the error in your case is "local variable 'gt_img' referenced before assignment".
Hi, Actually, the error in your case is "local variable 'gt_img' referenced before assignment".
Dear csjliang
I am not change your code. I just use your code to train my data set by multi-GPU.
The code has been validated. Please make sure to follow our instructions in ReadMe. Thanks.
Dear csjliang
When I use Distributed Training by README
2021-07-25 20:04:34,446 INFO: [LPTN_..][epoch: 16, iter: 19,800, lr:(1.000e-04,)] [eta: 2 days, 18:43:45, time (data): 0.178 (0.001)] l_g_pix: 3.1238e+01 l_g_gan: 8.7738e+01 l_d_real: 7.0186e+01 out_d_real: -7.0186e+01 l_d_fake: -8.7537e+01 out_dfake: -8.7537e+01 2021-07-25 20:06:00,337 INFO: [LPTN..][epoch: 17, iter: 19,900, lr:(1.000e-04,)] [eta: 2 days, 18:42:21, time (data): 0.504 (0.001)] l_g_pix: 2.0734e+01 l_g_gan: 9.3872e+01 l_d_real: 6.7697e+01 out_d_real: -6.7697e+01 l_d_fake: -9.4580e+01 out_dfake: -9.4580e+01 2021-07-25 20:07:30,459 INFO: [LPTN..][epoch: 17, iter: 20,000, lr:(1.000e-04,)] [eta: 2 days, 18:41:57, time (data): 0.202 (0.001)] l_g_pix: 3.0153e+01 l_g_gan: 9.9768e+01 l_d_real: 7.4591e+01 out_d_real: -7.4591e+01 l_d_fake: -9.9862e+01 out_d_fake: -9.9862e+01 2021-07-25 20:07:30,460 INFO: Saving models and training states. 0%| | 0/998 [00:00<?, ?image/s]2021-07-25 20:07:30,515 INFO: Only support single GPU validation. 0%| | 0/998 [00:00<?, ?image/s]Traceback (most recent call last): File "codes/train.py", line 249, in
main()
File "codes/train.py", line 226, in main
model.validation(val_loader, current_iter, tb_logger,
File "/home/delight-gpu/project/LPTN/codes/models/base_model.py", line 45, in validation
self.dist_validation(dataloader, current_iter, tb_logger, save_img)
File "/home/delight-gpu/project/LPTN/codes/models/lptn_model.py", line 169, in dist_validation
self.nondist_validation(dataloader, current_iter, tb_logger, save_img)
File "/home/delight-gpu/project/LPTN/codes/models/lptn_model.py", line 225, in nondist_validation
metric_module, metric_type)(result_img, gtimg, **opt)
UnboundLocalError: local variable 'gt_img' referenced before assignment
Traceback (most recent call last):
File "codes/train.py", line 249, in
main()
File "codes/train.py", line 226, in main
model.validation(val_loader, current_iter, tb_logger,
File "/home/delight-gpu/project/LPTN/codes/models/base_model.py", line 45, in validation
self.dist_validation(dataloader, current_iter, tb_logger, save_img)
File "/home/delight-gpu/project/LPTN/codes/models/lptn_model.py", line 169, in dist_validation
self.nondist_validation(dataloader, current_iter, tb_logger, save_img)
File "/home/delight-gpu/project/LPTN/codes/models/lptn_model.py", line 225, in nondist_validation
metric_module, metric_type)(result_img, gtimg, **opt)
UnboundLocalError: local variable 'gt_img' referenced before assignment
Traceback (most recent call last):
File "codes/train.py", line 249, in
main()
File "codes/train.py", line 226, in main
model.validation(val_loader, current_iter, tb_logger,
File "/home/delight-gpu/project/LPTN/codes/models/base_model.py", line 45, in validation
self.dist_validation(dataloader, current_iter, tb_logger, save_img)
File "/home/delight-gpu/project/LPTN/codes/models/lptn_model.py", line 169, in dist_validation
self.nondist_validation(dataloader, current_iter, tb_logger, save_img)
File "/home/delight-gpu/project/LPTN/codes/models/lptn_model.py", line 225, in nondist_validation
metric_module, metric_type)(result_img, gtimg, **opt)
UnboundLocalError: local variable 'gt_img' referenced before assignment
0%| | 0/998 [00:02<?, ?image/s]
0%| | 0/998 [00:02<?, ?image/s]
0%| | 0/998 [00:03<?, ?image/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11662) of binary: /home/delight-gpu/anaconda3/envs/lptn/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=4321
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2]
role_ranks=[0, 1, 2]
global_ranks=[0, 1, 2]
role_world_sizes=[3, 3, 3]
global_world_sizes=[3, 3, 3]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_1/0/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_1/1/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_1/2/error.json Traceback (most recent call last): File "codes/train.py", line 249, in
main()
File "codes/train.py", line 128, in main
Traceback (most recent call last):
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
File "codes/train.py", line 249, in
init_dist(args.launcher)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
main()
File "codes/train.py", line 128, in main
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
init_dist(args.launcher)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
Traceback (most recent call last):
File "codes/train.py", line 249, in
main()
File "codes/train.py", line 128, in main
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
init_dist(args.launcher)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
_init_dist_pytorch(backend, kwargs)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
_init_dist_pytorch(backend, kwargs)
_init_dist_pytorch(backend, **kwargs) File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch dist.init_process_group(backend=backend, kwargs) File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group dist.init_process_group(backend=backend, kwargs) dist.init_process_group(backend=backend, **kwargs) File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group _store_based_barrier(rank, store, timeout) File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier _store_based_barrier(rank, store, timeout)_store_based_barrier(rank, store, timeout)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier raise RuntimeError( RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=3, worker_count=6, timeout=0:30:00) raise RuntimeError(raise RuntimeError(
RuntimeErrorRuntimeError: : Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=3, worker_count=6, timeout=0:30:00)Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=3, worker_count=6, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 28662) of binary: /home/delight-gpu/anaconda3/envs/lptn/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=2 master_addr=127.0.0.1 master_port=4321 group_rank=0 group_world_size=1 local_ranks=[0, 1, 2] role_ranks=[0, 1, 2] global_ranks=[0, 1, 2] role_world_sizes=[3, 3, 3] global_world_sizes=[3, 3, 3]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_2/0/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_2/1/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_2/2/error.json Traceback (most recent call last): File "codes/train.py", line 249, in
main()
File "codes/train.py", line 128, in main
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
init_dist(args.launcher)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
_init_dist_pytorch(backend, kwargs)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
dist.init_process_group(backend=backend, kwargs)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=3, worker_count=9, timeout=0:30:00)
Traceback (most recent call last):
File "codes/train.py", line 249, in
main()
File "codes/train.py", line 128, in main
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
Traceback (most recent call last):
init_dist(args.launcher)
File "codes/train.py", line 249, in
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
_init_dist_pytorch(backend, kwargs)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
dist.init_process_group(backend=backend, kwargs)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=3, worker_count=9, timeout=0:30:00)
main()
File "codes/train.py", line 128, in main
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
init_dist(args.launcher)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
_init_dist_pytorch(backend, kwargs)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
dist.init_process_group(backend=backend, kwargs)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=3, worker_count=9, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 30527) of binary: /home/delight-gpu/anaconda3/envs/lptn/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=3
master_addr=127.0.0.1
master_port=4321
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2]
role_ranks=[0, 1, 2]
global_ranks=[0, 1, 2]
role_world_sizes=[3, 3, 3]
global_world_sizes=[3, 3, 3]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_3/0/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_3/1/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_l8eumjpm/none_twj5_557/attempt_3/2/error.json Traceback (most recent call last): File "codes/train.py", line 249, in
Traceback (most recent call last):
File "codes/train.py", line 249, in
main()
File "codes/train.py", line 128, in main
opt = parse_options(is_train=True)main()
File "codes/train.py", line 43, in parse_options File "codes/train.py", line 128, in main init_dist(args.launcher) File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist opt = parse_options(is_train=True) File "codes/train.py", line 43, in parse_options _init_dist_pytorch(backend, kwargs) File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch init_dist(args.launcher) File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist dist.init_process_group(backend=backend, kwargs) File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group _init_dist_pytorch(backend, kwargs) File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch dist.init_process_group(backend=backend, kwargs) File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group _store_based_barrier(rank, store, timeout) File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier _store_based_barrier(rank, store, timeout)raise RuntimeError(
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=3, worker_count=12, timeout=0:30:00) raise RuntimeError( RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=3, worker_count=12, timeout=0:30:00) Traceback (most recent call last): File "codes/train.py", line 249, in
main()
File "codes/train.py", line 128, in main
opt = parse_options(is_train=True)
File "codes/train.py", line 43, in parse_options
init_dist(args.launcher)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 14, in init_dist
_init_dist_pytorch(backend, kwargs)
File "/home/delight-gpu/project/LPTN/codes/utils/dist_util.py", line 25, in _init_dist_pytorch
dist.init_process_group(backend=backend, kwargs)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=3, worker_count=12, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 32378) of binary: /home/delight-gpu/anaconda3/envs/lptn/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0010943412780761719 seconds
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "32378", "role": "default", "hostname": "LIGHT-24B.PC.CS.CMU.EDU", "state": "FAILED", "total_run_time": 22569, "rdzv_backend": "static", "raw_error": "{\"message\": \"\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [3]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "32379", "role": "default", "hostname": "LIGHT-24B.PC.CS.CMU.EDU", "state": "FAILED", "total_run_time": 22569, "rdzv_backend": "static", "raw_error": "{\"message\": \"\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [3]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 2, "group_rank": 0, "worker_id": "32380", "role": "default", "hostname": "LIGHT-24B.PC.CS.CMU.EDU", "state": "FAILED", "total_run_time": 22569, "rdzv_backend": "static", "raw_error": "{\"message\": \"\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [2], \"role_rank\": [2], \"role_world_size\": [3]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "LIGHT-24B.PC.CS.CMU.EDU", "state": "SUCCEEDED", "total_run_time": 22569, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 3}}
/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:354: UserWarning:
CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 32378 (local_rank 0) FAILED (exitcode 1) Error msg: Process failed with exitcode 1 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record def trainer_main(args):
do train
warnings.warn(_no_error_file_warning_msg(rank, failure)) Traceback (most recent call last): File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/launch.py", line 173, in
main()
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/launch.py", line 169, in main
run(args)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/run.py", line 621, in run
elastic_launch(
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/delight-gpu/anaconda3/envs/lptn/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================= Root Cause: [0]: time: 2021-07-25_21:37:41 rank: 0 (local_rank: 0) exitcode: 1 (pid: 32378) error_file: <N/A> msg: "Process failed with exitcode 1"
Other Failures: [1]: time: 2021-07-25_21:37:41 rank: 1 (local_rank: 1) exitcode: 1 (pid: 32379) error_file: <N/A> msg: "Process failed with exitcode 1" [2]: time: 2021-07-25_21:37:41 rank: 2 (local_rank: 2) exitcode: 1 (pid: 32380) error_file: <N/A> msg: "Process failed with exitcode 1"