Closed D-Mad closed 2 years ago
Hi, just had the same error. It seems there is a problem with the download of the FlowNet2 checkpoint. I fixed it by downloading it manually here and changing line 30 in /imaginaire/third_party/flow_net/flow_net.py. The same error was obtained for the gancraft unit test. Download the checkpoint here and change line 36 in imaginaire/trainers/gancraft.py .
Best, Simon
@SLyra21 Hello! Can you elaborate more on how you changed the code in gancraft.py for the checkpoint?
@Skyrelixa Hi!
I did not really change anything, except the path to the manually downloaded checkpoint:
from ckpt = torch.load(checkpoint)
to ckpt = torch.load("your_path_to_the_checkpoint_file")
@SLyra21 Brilliant, thank you! :D
I have the same problem and I have modified it according to the above solution, but still have the same error message
Any advice is greatly appreciated!
@Feather06 did you download the new checkpoints and used the correct path? If the path is correct, the error should be gone! Did you try to use the global path?
I will also modify the path of base.py and it will be fine. Thank you!!!
I got a new error message after changing the path of base.py Line 280 as well.
Traceback (most recent call last):
File "inference.py", line 94, in
You don't have to change the base.py, the errors described above are only due to corrupt checkpoint files for flow_net and gancraft...
I change back the base.py but...@@
Checkpoint downloaded to C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire\checkpoints\configs/projects/fs_vid2vid/face_forensics/ampO1-1F_22ctFmo553nRHy1d_BX7aorc9zk9cF.pt
Traceback (most recent call last):
File "inference.py", line 94, in
In Windows you need to be careful with the slashes! Try to use this path in the torch.load:
"C:/Users/ETMProject/Desktop/Few_vid2vid/imaginaire/checkpoints/configs/projects/fs_vid2vid/face_forensics/ampO1-1F_22ctFmo553nRHy1d_BX7aorc9zk9cF.pt"
Here you can find some information: Referencing a File in Windows In Windows, there are a couple additional ways of referencing a file. That is because natively, Windows file path employs the backslash instead of the slash. Python allows using both in a Windows system, but there are a couple of pitfalls to watch out for. To sum them up:
Python lets you use OS-X/Linux style slashes "/" even in Windows. Therefore, you can refer to the file as 'C:/Users/narae/Desktop/alice.txt'. RECOMMENDED. If using backslash, because it is a special character in Python, you must remember to escape every instance: 'C:\Users\narae\Desktop\alice.txt' Alternatively, you can prefix the entire file name string with the rawstring marker "r": r'C:\Users\narae\Desktop\alice.txt'. That way, everything in the string is interpreted as a literal character, and you don't have to escape every backslash. source: https://sites.pitt.edu/~naraehan/python3/file_path_cwd.html
I have tried both "/" and "\" I give you my command => python inference.py --single_gpu --num_workers 0 --config configs/projects/fs_vid2vid/face_forensics/ampO1.yaml --output_dir projects/fs_vid2vid/output/face_forensics
Using random seed 0
cudnn benchmark: True
cudnn deterministic: False
Creating metadata
['images', 'landmarks-dlib68']
Data file extensions: {'images': 'jpg', 'landmarks-dlib68': 'json'}
Searching in dir: images
Found 1 sequences
Found 1 files
['images', 'landmarks-dlib68']
Data file extensions: {'images': 'jpg', 'landmarks-dlib68': 'json'}
Searching in dir: images
Found 1 sequences
Found 30 files
Folder at projects/fs_vid2vid/test_data/faceForensics/reference\images opened.
Folder at projects/fs_vid2vid/test_data/faceForensics/reference\landmarks-dlib68 opened.
Folder at projects/fs_vid2vid/test_data/faceForensics/driving\images opened.
Folder at projects/fs_vid2vid/test_data/faceForensics/driving\landmarks-dlib68 opened.
Num datasets: 2
Num sequences: 2
Max sequence length: 30
Epoch length: 1
Using random seed 0
Concatenate images:
ext: jpg
num_channels: 3
normalize: True for input.
Num. of channels in the input image: 3
Concatenate images:
ext: jpg
num_channels: 3
normalize: True for input.
Concatenate landmarks-dlib68:
ext: json
num_channels: 1
interpolator: None
normalize: False
pre_aug_ops: decode_json
post_aug_ops: vis::imaginaire.utils.visualization.face::connect_face_keypoints for input.
Num. of channels in the input label: 1
Concatenate images:
ext: jpg
num_channels: 3
normalize: True for input.
Num. of channels in the input image: 3
Concatenate images:
ext: jpg
num_channels: 3
normalize: True for input.
Num. of channels in the input image: 3
Concatenate images:
ext: jpg
num_channels: 3
normalize: True for input.
Num. of channels in the input image: 3
Initialized temporal embedding network with the reference one.
Concatenate images:
ext: jpg
num_channels: 3
normalize: True for input.
Concatenate landmarks-dlib68:
ext: json
num_channels: 1
interpolator: None
normalize: False
pre_aug_ops: decode_json
post_aug_ops: vis::imaginaire.utils.visualization.face::connect_face_keypoints for input.
Num. of channels in the input label: 1
Concatenate images:
ext: jpg
num_channels: 3
normalize: True for input.
Num. of channels in the input image: 3
Initialize net_G and net_D weights using type: xavier gain: 0.02
Using random seed 0
net_G parameter count: 91,145,502
net_D parameter count: 5,593,922
Use custom initialization for the generator.
Setup trainer.
Using automatic mixed precision training.
Augmentation policy:
GAN mode: hinge
Perceptual loss:
Mode: vgg19
Loss GAN Weight 1.0
Loss FeatureMatching Weight 10.0
Loss Perceptual Weight 10.0
Loss Flow Weight 10.0
Loss Flow_L1 Weight 10.0
Loss Flow_Warp Weight 10.0
Loss Flow_Mask Weight 10.0
Checkpoint downloaded to C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire\checkpoints\configs/projects/fs_vid2vid/face_forensics/ampO1-1F_22ctFmo553nRHy1d_BX7aorc9zk9cF.pt
Traceback (most recent call last):
File "inference.py", line 94, in
(imaginaire) C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire>python inference.py --single_gpu --num_workers 0 --config configs/projects/fs_vid2vid/face_forensics/ampO1.yaml --output_dir projects/fs_vid2vid/output/face_forensics
Using random seed 0
cudnn benchmark: True
cudnn deterministic: False
Creating metadata
['images', 'landmarks-dlib68']
Data file extensions: {'images': 'jpg', 'landmarks-dlib68': 'json'}
Searching in dir: images
Found 1 sequences
Found 1 files
['images', 'landmarks-dlib68']
Data file extensions: {'images': 'jpg', 'landmarks-dlib68': 'json'}
Searching in dir: images
Found 1 sequences
Found 30 files
Folder at projects/fs_vid2vid/test_data/faceForensics/reference\images opened.
Folder at projects/fs_vid2vid/test_data/faceForensics/reference\landmarks-dlib68 opened.
Folder at projects/fs_vid2vid/test_data/faceForensics/driving\images opened.
Folder at projects/fs_vid2vid/test_data/faceForensics/driving\landmarks-dlib68 opened.
Num datasets: 2
Num sequences: 2
Max sequence length: 30
Epoch length: 1
Using random seed 0
Concatenate images:
ext: jpg
num_channels: 3
normalize: True for input.
Num. of channels in the input image: 3
Concatenate images:
ext: jpg
num_channels: 3
normalize: True for input.
Concatenate landmarks-dlib68:
ext: json
num_channels: 1
interpolator: None
normalize: False
pre_aug_ops: decode_json
post_aug_ops: vis::imaginaire.utils.visualization.face::connect_face_keypoints for input.
Num. of channels in the input label: 1
Concatenate images:
ext: jpg
num_channels: 3
normalize: True for input.
Num. of channels in the input image: 3
Concatenate images:
ext: jpg
num_channels: 3
normalize: True for input.
Num. of channels in the input image: 3
Concatenate images:
ext: jpg
num_channels: 3
normalize: True for input.
Num. of channels in the input image: 3
Initialized temporal embedding network with the reference one.
Concatenate images:
ext: jpg
num_channels: 3
normalize: True for input.
Concatenate landmarks-dlib68:
ext: json
num_channels: 1
interpolator: None
normalize: False
pre_aug_ops: decode_json
post_aug_ops: vis::imaginaire.utils.visualization.face::connect_face_keypoints for input.
Num. of channels in the input label: 1
Concatenate images:
ext: jpg
num_channels: 3
normalize: True for input.
Num. of channels in the input image: 3
Initialize net_G and net_D weights using type: xavier gain: 0.02
Using random seed 0
net_G parameter count: 91,145,502
net_D parameter count: 5,593,922
Use custom initialization for the generator.
Setup trainer.
Using automatic mixed precision training.
Augmentation policy:
GAN mode: hinge
Perceptual loss:
Mode: vgg19
Loss GAN Weight 1.0
Loss FeatureMatching Weight 10.0
Loss Perceptual Weight 10.0
Loss Flow Weight 10.0
Loss Flow_L1 Weight 10.0
Loss Flow_Warp Weight 10.0
Loss Flow_Mask Weight 10.0
Checkpoint downloaded to C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire\checkpoints\configs/projects/fs_vid2vid/face_forensics/ampO1-1F_22ctFmo553nRHy1d_BX7aorc9zk9cF.pt
Traceback (most recent call last):
File "inference.py", line 94, in
Well, it still seems that there is an error with your paths... In "Checkpoint downloaded to C:\Users\ETMProject\Desktop\Few_vid2vid\imaginaire\checkpoints\configs/projects/fs_vid2vid/face_forensics/ampO1-1F_22ctFmo553nRHy1d_BX7aorc9zk9cF.pt" there is a change from "\ to "/" which seems to be unusual. Did you try to reinstall the repo after changing both paths in the files describes above? The unit test should pass, otherwise there is still an error.
I have not reinstalled again
After the installation, first try to run the unit tests by running the test_training.sh script in the "scripts" folder. This should run without errors.
Hello, I tried to run command below: python inference.py --single_gpu --num_workers 0 --config configs/projects/fs_vid2vid/face_forensics/ampO1.yaml --output_dir projects/fs_vid2vid/output/face_forensics
I got a similar error as mentioned in this issue, then I downloaded the two models (flownet and gancraft) manually and located them in the correct location however I got the below error. Any advice? @Feather06 I noticed you got a similar error. Did you manage to solve this issue?
During handling of the above exception, another exception occurred: Traceback (most recent call last): File "inference.py", line 96, in main() File "inference.py", line 88, in main trainer.load_checkpoint(cfg, args.checkpoint) File "/medias/db/ImagingSecurity_misc/Sahar/imaginaire/imaginaire/trainers/base.py", line 319, in load_checkpoint net_G_module.load_pretrained_network(self.net_G, checkpoint['net_G']) File "/medias/db/ImagingSecurity_misc/Sahar/imaginaire/imaginaire/generators/fs_vid2vid.py", line 293, in load_pretrained_network kp = prefix + k TypeError: unsupported operand type(s) for +: 'collections.OrderedDict' and 'str' (//medias/db/ImagingSecurity_misc/Sahar/env/fs_vid2vid) -blutch$
I installed it on another ubuntu and it run.
@SaharHusseini Did you first try to run the unit tests? You can see that there is a TypeError, so the checkpoint file is not found due to a typo or corrupt path. If you use Windows, please be careful with the '/' and '\'.
Hello,
Thank you for your advice. I tried to run the unit test but it runs forever and does not finish. It just passes two first tests and then does not show me anything.
However, I managed to run the code. First I used the model you proposed here: https://docs.google.com/uc?export=download&id=1NIh3_UZ6uqvzS4mJ4JVhfyYQuG9ZMmvA and it did not work for me, then I manually downloaded the model from the MODEL ZOO and it worked.
I just think it is good to mention in the documentation that the model also can be downloaded from the MODEL ZOO. It took me a few days to understand it.
root@4460709f4b11:/workspace# bash scripts/test_training.sh /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:
torch.distributed.launch
is Deprecated. Use torch.distributed.run INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_oe27bhot/none_vqivpge9 INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_oe27bhot/none_vqivpge9/attempt_0/0/error.json Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /root/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth 100%|█████████████████████████████████████████████████████████████████████| 548M/548M [01:40<00:00, 5.71MB/s] /opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:3590: UserWarning: Default upsampling behavior when mode=bicubic is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn( /opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:3638: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn( /opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ../c10/core/TensorImpl.h:1153.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0005445480346679688 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "386", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 125, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 125, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}} python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/spade.yaml >> /tmp/unit_test.log [Success] /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:
torch.distributed.launch
is Deprecated. Use torch.distributed.run INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_4bbnta_g/none_vmrfh3_4 INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_4bbnta_g/none_vmrfh3_4/attempt_0/0/error.json /opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ../c10/core/TensorImpl.h:1153.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0006163120269775391 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "936", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}} python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/pix2pixHD.yaml >> /tmp/unit_test.log [Success] /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:
torch.distributed.launch
is Deprecated. Use torch.distributed.run INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_62ca57fw/none_cbejg5ie INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_62ca57fw/none_cbejg5ie/attempt_0/0/error.json [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0006213188171386719 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "1313", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}} python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/munit.yaml >> /tmp/unit_test.log [Success] /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:
torch.distributed.launch
is Deprecated. Use torch.distributed.run INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_peoaucrn/none_je_rxzez INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_peoaucrn/none_je_rxzez/attempt_0/0/error.json [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0005629062652587891 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "1561", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}} python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/munit_patch.yaml >> /tmp/unit_test.log [Success] /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:
torch.distributed.launch
is Deprecated. Use torch.distributed.run INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_gf_rgw_1/none_yd9hagbt INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_gf_rgw_1/none_yd9hagbt/attempt_0/0/error.json /opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:3638: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn( [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0006194114685058594 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "1809", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}} python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/unit.yaml >> /tmp/unit_test.log [Success] /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:
torch.distributed.launch
is Deprecated. Use torch.distributed.run INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastick4k83su/nonedoq690x INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastick4k83su/nonedoq690x/attempt_0/0/error.json INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0004730224609375 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "2358", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}} python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/funit.yaml >> /tmp/unit_test.log [Success] /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:
torch.distributed.launch
is Deprecated. Use torch.distributed.run INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_qmiyb0yp/none_nw6jcn_x INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_qmiyb0yp/none_nw6jcn_x/attempt_0/0/error.json INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish /opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0007641315460205078 seconds {"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "2600", "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 0}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "4460709f4b11", "state": "SUCCEEDED", "total_run_time": 15, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}} python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/coco_funit.yaml >> /tmp/unit_test.log [Success] Traceback (most recent call last): File "train.py", line 168, in
main()
File "train.py", line 92, in main
trainer = get_trainer(cfg, net_G, net_D,
File "/workspace/imaginaire/utils/trainer.py", line 59, in get_trainer
trainer = trainer_lib.Trainer(cfg, net_G, net_D,
File "/workspace/imaginaire/trainers/vid2vid.py", line 44, in init
super(Trainer, self).init(cfg, net_G, net_D, opt_G,
File "/workspace/imaginaire/trainers/base.py", line 99, in init
self._init_loss(cfg)
File "/workspace/imaginaire/trainers/vid2vid.py", line 145, in _init_loss
self.criteria['Flow'] = FlowLoss(cfg)
File "/workspace/imaginaire/losses/flow.py", line 59, in init
self.flowNet = flow_module.FlowNet(pretrained=True)
File "/workspace/imaginaire/third_party/flow_net/flow_net.py", line 30, in init
checkpoint = torch.load(flownet2_path,
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args)
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 777, in _legacy_load
magic_number = pickle_module.load(f, pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.
python train.py --single_gpu --config configs/unit_test/vid2vid_street.yaml >> /tmp/unit_test.log [Failure]
root@4460709f4b11:/workspace#
`