NVlabs / imaginaire

NVIDIA's Deep Imagination Team's PyTorch Library
Other
3.99k stars 444 forks source link

error : vid2vid_street.yaml >> /tmp/unit_test.log [Failure] in run test bash scripts/test_training.sh #131

Closed D-Mad closed 2 years ago

D-Mad commented 2 years ago

root@4460709f4b11:/workspace# bash scripts/test_training.sh /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:torch.distributed.launch is Deprecated. Use torch.distributed.run INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_oe27bhot/none_vqivpge9 INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_oe27bhot/none_vqivpge9/attempt_0/0/error.json Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /root/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth 100%|█████████████████████████████████████████████████████████████████████| 548M/548M [01:40<00:00, 5.71MB/s] /opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:3590: UserWarning: Default upsampling behavior when mode=bicubic is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn( /opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:3638: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn( /opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ../c10/core/TensorImpl.h:1153.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0005445480346679688 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "386", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 125, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 125, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}} python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/spade.yaml >> /tmp/unit_test.log [Success] /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:torch.distributed.launch is Deprecated. Use torch.distributed.run INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_4bbnta_g/none_vmrfh3_4 INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_4bbnta_g/none_vmrfh3_4/attempt_0/0/error.json /opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ../c10/core/TensorImpl.h:1153.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0006163120269775391 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "936", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}} python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/pix2pixHD.yaml >> /tmp/unit_test.log [Success] /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:torch.distributed.launch is Deprecated. Use torch.distributed.run INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_62ca57fw/none_cbejg5ie INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_62ca57fw/none_cbejg5ie/attempt_0/0/error.json [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0006213188171386719 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "1313", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}} python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/munit.yaml >> /tmp/unit_test.log [Success] /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:torch.distributed.launch is Deprecated. Use torch.distributed.run INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_peoaucrn/none_je_rxzez INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_peoaucrn/none_je_rxzez/attempt_0/0/error.json [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0005629062652587891 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "1561", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}} python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/munit_patch.yaml >> /tmp/unit_test.log [Success] /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:torch.distributed.launch is Deprecated. Use torch.distributed.run INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_gf_rgw_1/none_yd9hagbt INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_gf_rgw_1/none_yd9hagbt/attempt_0/0/error.json /opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:3638: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn( [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0006194114685058594 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "1809", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}} python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/unit.yaml >> /tmp/unit_test.log [Success] /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:torch.distributed.launch is Deprecated. Use torch.distributed.run INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastick4k83su/nonedoq690x INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastick4k83su/nonedoq690x/attempt_0/0/error.json INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0004730224609375 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "2358", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}} python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/funit.yaml >> /tmp/unit_test.log [Success] /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:torch.distributed.launch is Deprecated. Use torch.distributed.run INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_qmiyb0yp/none_nw6jcn_x INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_qmiyb0yp/none_nw6jcn_x/attempt_0/0/error.json INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0007641315460205078 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "2600", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}} python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/coco_funit.yaml >> /tmp/unit_test.log [Success] Traceback (most recent call last): File "train.py", line 168, in main() File "train.py", line 92, in main trainer = get_trainer(cfg, net_G, net_D, File "/workspace/imaginaire/utils/trainer.py", line 59, in get_trainer trainer = trainer_lib.Trainer(cfg, net_G, net_D, File "/workspace/imaginaire/trainers/vid2vid.py", line 44, in init super(Trainer, self).init(cfg, net_G, net_D, opt_G, File "/workspace/imaginaire/trainers/base.py", line 99, in init self._init_loss(cfg) File "/workspace/imaginaire/trainers/vid2vid.py", line 145, in _init_loss self.criteria['Flow'] = FlowLoss(cfg) File "/workspace/imaginaire/losses/flow.py", line 59, in init self.flowNet = flow_module.FlowNet(pretrained=True) File "/workspace/imaginaire/third_party/flow_net/flow_net.py", line 30, in init checkpoint = torch.load(flownet2_path, File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 608, in load return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args) File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 777, in _legacy_load magic_number = pickle_module.load(f, pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'. python train.py --single_gpu --config configs/unit_test/vid2vid_street.yaml >> /tmp/unit_test.log [Failure] root@4460709f4b11:/workspace#

`

SLyra21 commented 2 years ago

Hi, just had the same error. It seems there is a problem with the download of the FlowNet2 checkpoint. I fixed it by downloading it manually here and changing line 30 in /imaginaire/third_party/flow_net/flow_net.py. The same error was obtained for the gancraft unit test. Download the checkpoint here and change line 36 in imaginaire/trainers/gancraft.py .

Best, Simon

Skyrelixa commented 2 years ago

@SLyra21 Hello! Can you elaborate more on how you changed the code in gancraft.py for the checkpoint?

SLyra21 commented 2 years ago

@Skyrelixa Hi! I did not really change anything, except the path to the manually downloaded checkpoint: from ckpt = torch.load(checkpoint) to ckpt = torch.load("your_path_to_the_checkpoint_file")

Skyrelixa commented 2 years ago

@SLyra21 Brilliant, thank you! :D

Feather06 commented 2 years ago

I have the same problem and I have modified it according to the above solution, but still have the same error message

Any advice is greatly appreciated!

SLyra21 commented 2 years ago

@Feather06 did you download the new checkpoints and used the correct path? If the path is correct, the error should be gone! Did you try to use the global path?

Feather06 commented 2 years ago

I will also modify the path of base.py and it will be fine. Thank you!!!

Feather06 commented 2 years ago

I got a new error message after changing the path of base.py Line 280 as well.

Traceback (most recent call last): File "inference.py", line 94, in main() File "inference.py", line 85, in main trainer.load_checkpoint(cfg, args.checkpoint) File "C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire\imaginaire\trainers\base.py", line 319, in load_checkpoint net_G_module.load_pretrained_network(self.net_G, checkpoint['net_G']) File "C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire\imaginaire\generators\fs_vid2vid.py", line 293, in load_pretrained_network kp = prefix + k TypeError: unsupported operand type(s) for +: 'collections.OrderedDict' and 'str'

SLyra21 commented 2 years ago

You don't have to change the base.py, the errors described above are only due to corrupt checkpoint files for flow_net and gancraft...

Feather06 commented 2 years ago

I change back the base.py but...@@

Checkpoint downloaded to C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire\checkpoints\configs/projects/fs_vid2vid/face_forensics/ampO1-1F_22ctFmo553nRHy1d_BX7aorc9zk9cF.pt Traceback (most recent call last): File "inference.py", line 94, in main() File "inference.py", line 85, in main trainer.load_checkpoint(cfg, args.checkpoint) File "C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire\imaginaire\trainers\base.py", line 280, in load_checkpoint checkpoint = torch.load( File "C:\Users\ETMProject.conda\envs\imaginaire\lib\site-packages\torch\serialization.py", line 608, in load return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args) File "C:\Users\ETMProject.conda\envs\imaginaire\lib\site-packages\torch\serialization.py", line 777, in _legacy_load magic_number = pickle_module.load(f, pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'.

SLyra21 commented 2 years ago

In Windows you need to be careful with the slashes! Try to use this path in the torch.load:

"C:/Users/ETMProject/Desktop/Few_vid2vid/imaginaire/checkpoints/configs/projects/fs_vid2vid/face_forensics/ampO1-1F_22ctFmo553nRHy1d_BX7aorc9zk9cF.pt"

Here you can find some information: Referencing a File in Windows In Windows, there are a couple additional ways of referencing a file. That is because natively, Windows file path employs the backslash instead of the slash. Python allows using both in a Windows system, but there are a couple of pitfalls to watch out for. To sum them up:

Python lets you use OS-X/Linux style slashes "/" even in Windows. Therefore, you can refer to the file as 'C:/Users/narae/Desktop/alice.txt'. RECOMMENDED. If using backslash, because it is a special character in Python, you must remember to escape every instance: 'C:\Users\narae\Desktop\alice.txt' Alternatively, you can prefix the entire file name string with the rawstring marker "r": r'C:\Users\narae\Desktop\alice.txt'. That way, everything in the string is interpreted as a literal character, and you don't have to escape every backslash. source: https://sites.pitt.edu/~naraehan/python3/file_path_cwd.html

Feather06 commented 2 years ago

I have tried both "/" and "\" I give you my command => python inference.py --single_gpu --num_workers 0 --config configs/projects/fs_vid2vid/face_forensics/ampO1.yaml --output_dir projects/fs_vid2vid/output/face_forensics

Using random seed 0 cudnn benchmark: True cudnn deterministic: False Creating metadata ['images', 'landmarks-dlib68'] Data file extensions: {'images': 'jpg', 'landmarks-dlib68': 'json'} Searching in dir: images Found 1 sequences Found 1 files ['images', 'landmarks-dlib68'] Data file extensions: {'images': 'jpg', 'landmarks-dlib68': 'json'} Searching in dir: images Found 1 sequences Found 30 files Folder at projects/fs_vid2vid/test_data/faceForensics/reference\images opened. Folder at projects/fs_vid2vid/test_data/faceForensics/reference\landmarks-dlib68 opened. Folder at projects/fs_vid2vid/test_data/faceForensics/driving\images opened. Folder at projects/fs_vid2vid/test_data/faceForensics/driving\landmarks-dlib68 opened. Num datasets: 2 Num sequences: 2 Max sequence length: 30 Epoch length: 1 Using random seed 0 Concatenate images: ext: jpg num_channels: 3 normalize: True for input. Num. of channels in the input image: 3 Concatenate images: ext: jpg num_channels: 3 normalize: True for input. Concatenate landmarks-dlib68: ext: json num_channels: 1 interpolator: None normalize: False pre_aug_ops: decode_json post_aug_ops: vis::imaginaire.utils.visualization.face::connect_face_keypoints for input. Num. of channels in the input label: 1 Concatenate images: ext: jpg num_channels: 3 normalize: True for input. Num. of channels in the input image: 3 Concatenate images: ext: jpg num_channels: 3 normalize: True for input. Num. of channels in the input image: 3 Concatenate images: ext: jpg num_channels: 3 normalize: True for input. Num. of channels in the input image: 3 Initialized temporal embedding network with the reference one. Concatenate images: ext: jpg num_channels: 3 normalize: True for input. Concatenate landmarks-dlib68: ext: json num_channels: 1 interpolator: None normalize: False pre_aug_ops: decode_json post_aug_ops: vis::imaginaire.utils.visualization.face::connect_face_keypoints for input. Num. of channels in the input label: 1 Concatenate images: ext: jpg num_channels: 3 normalize: True for input. Num. of channels in the input image: 3 Initialize net_G and net_D weights using type: xavier gain: 0.02 Using random seed 0 net_G parameter count: 91,145,502 net_D parameter count: 5,593,922 Use custom initialization for the generator. Setup trainer. Using automatic mixed precision training. Augmentation policy: GAN mode: hinge Perceptual loss: Mode: vgg19 Loss GAN Weight 1.0 Loss FeatureMatching Weight 10.0 Loss Perceptual Weight 10.0 Loss Flow Weight 10.0 Loss Flow_L1 Weight 10.0 Loss Flow_Warp Weight 10.0 Loss Flow_Mask Weight 10.0 Checkpoint downloaded to C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire\checkpoints\configs/projects/fs_vid2vid/face_forensics/ampO1-1F_22ctFmo553nRHy1d_BX7aorc9zk9cF.pt Traceback (most recent call last): File "inference.py", line 94, in main() File "inference.py", line 85, in main trainer.load_checkpoint(cfg, args.checkpoint) File "C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire\imaginaire\trainers\base.py", line 280, in load_checkpoint checkpoint = torch.load( File "C:\Users\ETMProject.conda\envs\imaginaire\lib\site-packages\torch\serialization.py", line 608, in load return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args) File "C:\Users\ETMProject.conda\envs\imaginaire\lib\site-packages\torch\serialization.py", line 777, in _legacy_load magic_number = pickle_module.load(f, pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'.

(imaginaire) C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire>python inference.py --single_gpu --num_workers 0 --config configs/projects/fs_vid2vid/face_forensics/ampO1.yaml --output_dir projects/fs_vid2vid/output/face_forensics Using random seed 0 cudnn benchmark: True cudnn deterministic: False Creating metadata ['images', 'landmarks-dlib68'] Data file extensions: {'images': 'jpg', 'landmarks-dlib68': 'json'} Searching in dir: images Found 1 sequences Found 1 files ['images', 'landmarks-dlib68'] Data file extensions: {'images': 'jpg', 'landmarks-dlib68': 'json'} Searching in dir: images Found 1 sequences Found 30 files Folder at projects/fs_vid2vid/test_data/faceForensics/reference\images opened. Folder at projects/fs_vid2vid/test_data/faceForensics/reference\landmarks-dlib68 opened. Folder at projects/fs_vid2vid/test_data/faceForensics/driving\images opened. Folder at projects/fs_vid2vid/test_data/faceForensics/driving\landmarks-dlib68 opened. Num datasets: 2 Num sequences: 2 Max sequence length: 30 Epoch length: 1 Using random seed 0 Concatenate images: ext: jpg num_channels: 3 normalize: True for input. Num. of channels in the input image: 3 Concatenate images: ext: jpg num_channels: 3 normalize: True for input. Concatenate landmarks-dlib68: ext: json num_channels: 1 interpolator: None normalize: False pre_aug_ops: decode_json post_aug_ops: vis::imaginaire.utils.visualization.face::connect_face_keypoints for input. Num. of channels in the input label: 1 Concatenate images: ext: jpg num_channels: 3 normalize: True for input. Num. of channels in the input image: 3 Concatenate images: ext: jpg num_channels: 3 normalize: True for input. Num. of channels in the input image: 3 Concatenate images: ext: jpg num_channels: 3 normalize: True for input. Num. of channels in the input image: 3 Initialized temporal embedding network with the reference one. Concatenate images: ext: jpg num_channels: 3 normalize: True for input. Concatenate landmarks-dlib68: ext: json num_channels: 1 interpolator: None normalize: False pre_aug_ops: decode_json post_aug_ops: vis::imaginaire.utils.visualization.face::connect_face_keypoints for input. Num. of channels in the input label: 1 Concatenate images: ext: jpg num_channels: 3 normalize: True for input. Num. of channels in the input image: 3 Initialize net_G and net_D weights using type: xavier gain: 0.02 Using random seed 0 net_G parameter count: 91,145,502 net_D parameter count: 5,593,922 Use custom initialization for the generator. Setup trainer. Using automatic mixed precision training. Augmentation policy: GAN mode: hinge Perceptual loss: Mode: vgg19 Loss GAN Weight 1.0 Loss FeatureMatching Weight 10.0 Loss Perceptual Weight 10.0 Loss Flow Weight 10.0 Loss Flow_L1 Weight 10.0 Loss Flow_Warp Weight 10.0 Loss Flow_Mask Weight 10.0 Checkpoint downloaded to C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire\checkpoints\configs/projects/fs_vid2vid/face_forensics/ampO1-1F_22ctFmo553nRHy1d_BX7aorc9zk9cF.pt Traceback (most recent call last): File "inference.py", line 94, in main() File "inference.py", line 85, in main trainer.load_checkpoint(cfg, args.checkpoint) File "C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire\imaginaire\trainers\base.py", line 280, in load_checkpoint checkpoint = torch.load( File "C:\Users\ETMProject.conda\envs\imaginaire\lib\site-packages\torch\serialization.py", line 608, in load return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args) File "C:\Users\ETMProject.conda\envs\imaginaire\lib\site-packages\torch\serialization.py", line 777, in _legacy_load magic_number = pickle_module.load(f, pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'.

SLyra21 commented 2 years ago

Well, it still seems that there is an error with your paths... In "Checkpoint downloaded to C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire\checkpoints\configs/projects/fs_vid2vid/face_forensics/ampO1-1F_22ctFmo553nRHy1d_BX7aorc9zk9cF.pt" there is a change from "\ to "/" which seems to be unusual. Did you try to reinstall the repo after changing both paths in the files describes above? The unit test should pass, otherwise there is still an error.

Feather06 commented 2 years ago

I have not reinstalled again

SLyra21 commented 2 years ago

After the installation, first try to run the unit tests by running the test_training.sh script in the "scripts" folder. This should run without errors.

SaharHusseini commented 2 years ago

Hello, I tried to run command below: python inference.py --single_gpu --num_workers 0 --config configs/projects/fs_vid2vid/face_forensics/ampO1.yaml --output_dir projects/fs_vid2vid/output/face_forensics

I got a similar error as mentioned in this issue, then I downloaded the two models (flownet and gancraft) manually and located them in the correct location however I got the below error. Any advice? @Feather06 I noticed you got a similar error. Did you manage to solve this issue?

During handling of the above exception, another exception occurred: Traceback (most recent call last): File "inference.py", line 96, in main() File "inference.py", line 88, in main trainer.load_checkpoint(cfg, args.checkpoint) File "/medias/db/ImagingSecurity_misc/Sahar/imaginaire/imaginaire/trainers/base.py", line 319, in load_checkpoint net_G_module.load_pretrained_network(self.net_G, checkpoint['net_G']) File "/medias/db/ImagingSecurity_misc/Sahar/imaginaire/imaginaire/generators/fs_vid2vid.py", line 293, in load_pretrained_network kp = prefix + k TypeError: unsupported operand type(s) for +: 'collections.OrderedDict' and 'str' (//medias/db/ImagingSecurity_misc/Sahar/env/fs_vid2vid) -blutch$

Feather06 commented 2 years ago

I installed it on another ubuntu and it run.

SLyra21 commented 2 years ago

@SaharHusseini Did you first try to run the unit tests? You can see that there is a TypeError, so the checkpoint file is not found due to a typo or corrupt path. If you use Windows, please be careful with the '/' and '\'.

SaharHusseini commented 2 years ago

Hello,

Thank you for your advice. I tried to run the unit test but it runs forever and does not finish. It just passes two first tests and then does not show me anything.

However, I managed to run the code. First I used the model you proposed here: https://docs.google.com/uc?export=download&id=1NIh3_UZ6uqvzS4mJ4JVhfyYQuG9ZMmvA and it did not work for me, then I manually downloaded the model from the MODEL ZOO and it worked.

I just think it is good to mention in the documentation that the model also can be downloaded from the MODEL ZOO. It took me a few days to understand it.