GaParmar / img2img-turbo

One-step image-to-image with Stable Diffusion turbo: sketch2image, day2night, and more
MIT License
1.45k stars 163 forks source link

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory #74

Open LinQiWei-cs opened 1 month ago

LinQiWei-cs commented 1 month ago

I have the following questions, please help to answer: I trained unpaired cycle-gan-turbo using my own dataset. An error occurs after executing the training script:

File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/cleanfid/inception_torchscript.py", line 35, in init self.base = torch.jit.load(path).eval() File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/jit/_serialization.py", line 162, in load cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files, _restore_shapes) # type: ignore[call-arg] RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 834576) of binary: /home/JJ_Group/linqw2405/img2img-turbo/venv/bin/python3 Traceback (most recent call last): File "/home/JJ_Group/linqw2405/img2img-turbo/venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1088, in launch_command multi_gpu_launcher(args) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher distrib_run.run(args) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


I checked that my vgg/dino weights were intact and not damaged. At the same time, the error is intermittent. 下载

GaParmar commented 1 month ago

This error is from the clean-fid evaluation code. Can you try deleting this file /tmp/inception-2015-12-05.pt and see if the error persists? It looks like that file was not downloaded correctly. -Gaurav

zhangsngood commented 1 month ago

I have the following questions, please help to answer:

I trained unpaired cycle-gan-turbo using my own dataset. An error occurs after executing the training script:

File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/cleanfid/inception_torchscript.py", line 35, in init self.base = torch.jit.load(path).eval() File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/jit/_serialization.py", line 162, in load cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files, _restore_shapes) # type: ignore[call-arg] RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 834576) of binary: /home/JJ_Group/linqw2405/img2img-turbo/venv/bin/python3 Traceback (most recent call last): File "/home/JJ_Group/linqw2405/img2img-turbo/venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1088, in launch_command multi_gpu_launcher(args) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher distrib_run.run(args) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I checked that my vgg/dino weights were intact and not damaged. At the same time, the error is intermittent. 下载

请问这个问题解决了吗,我也遇到了一样的问题。

GaParmar commented 1 week ago

@zhangsngood Could you try deleting /tmp/inception-2015-12-05.pt and rerunning the code to see if the problem gets fixed?

xinlin-xiao commented 2 days ago

您能否尝试删除并重新运行代码以查看问题是否得到解决?/tmp/inception-2015-12-05.pt

@GaParmar deleting /tmp/inception-2015-12-05.pt and rerunning the code,can not fix it. Do you have any other suggestions?

xinlin-xiao commented 2 days ago

@GaParmar

I trained unpaired cycle-gan-turbo using this samples [training_cyclegan_turbo.md

](https://github.com/GaParmar/img2img-turbo/blob/main/docs/training_cyclegan_turbo.md)

it return RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

I try deleting /tmp/inception-2015-12-05.pt and rerunning the code,can not fix it. Do you have any other suggestions?

(img2img-turbo) root@trans72-train-02:/mnt/data1/download_new/img2img-turbo# accelerate launch --main_process_port 29501 src/train_cyclegan_turbo.py --pretrained_model_name_or_path="stabilityai/sd-turbo" --output_dir="output/cyclegan_turbo/my_horse2zebra" --dataset_folder "data/my_horse2zebra" --train_img_prep "resize_286_randomcrop_256x256_hflip" --val_img_prep "no_resize" --learning_rate="1e-5" --max_train_steps=25000 --train_batch_size=1 --gradient_accumulation_steps=1 --report_to "wandb" --tracker_project_name "gparmar_unpaired_h2z_cycle_debug_v2" --enable_xformers_memory_efficient_attention --validation_steps 250 --lambda_gan 0.5 --lambda_idt 1 --lambda_cycle 1 Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. /mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning:resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True. warnings.warn( /mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning:resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True. warnings.warn( /mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning:resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True. warnings.warn( /mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning:resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True. warnings.warn( /mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torch/optim/adamw.py:50: UserWarning: optimizer contains a parameter group with duplicate parameters; in future, this will cause an error; see github.com/pytorch/pytorch/issues/40967 for more information super().__init__(params, defaults) Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off] /mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum orNonefor 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passingweights=VGG16_Weights.IMAGENET1K_V1. You can also useweights=VGG16_Weights.DEFAULTto get the most up-to-date weights. warnings.warn(msg) /mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torch/optim/adamw.py:50: UserWarning: optimizer contains a parameter group with duplicate parameters; in future, this will cause an error; see github.com/pytorch/pytorch/issues/40967 for more information super().__init__(params, defaults) Loading model from: /mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/lpips/weights/v0.1/vgg.pth 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 140/140 [00:00<00:00, 768.36it/s] Found 140 images in the folder output/cyclegan_turbo/my_horse2zebra/fid_reference_a2b 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:10<00:00, 1.69it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 120/120 [00:00<00:00, 895.92it/s] Found 120 images in the folder output/cyclegan_turbo/my_horse2zebra/fid_reference_b2a 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:01<00:00, 12.73it/s] Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off] /mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum orNonefor 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passingweights=VGG16_Weights.IMAGENET1K_V1. You can also useweights=VGG16_Weights.DEFAULTto get the most up-to-date weights. warnings.warn(msg) Loading model from: /mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/lpips/weights/v0.1/vgg.pth wandb: Currently logged in as: screw. Usewandb login --reloginto force relogin wandb: Tracking run with wandb version 0.17.9 wandb: Run data is saved locally in /mnt/data1/download_new/img2img-turbo/wandb/run-20240910_173030-4qyk19p6 wandb: Runwandb offlineto turn off syncing. wandb: Syncing run glorious-butterfly-9 wandb: ⭐️ View project at https://wandb.ai/screw/gparmar_unpaired_h2z_cycle_debug_v2 wandb: 🚀 View run at https://wandb.ai/screw/gparmar_unpaired_h2z_cycle_debug_v2/runs/4qyk19p6 Steps: 0%| | 0/25000 [00:00<?, ?it/s]/mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] /mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Steps: 0%| | 1/25000 [00:06<46:43:48, 6.73s/it]Downloading: "https://github.com/facebookresearch/dino/zipball/main" to /root/.cache/torch/hub/main.zip Traceback (most recent call last): File "/mnt/data1/download_new/img2img-turbo/src/train_cyclegan_turbo.py", line 390, in <module> main(args) File "/mnt/data1/download_new/img2img-turbo/src/train_cyclegan_turbo.py", line 314, in main net_dino = DinoStructureLoss() File "/mnt/data1/download_new/img2img-turbo/src/my_utils/dino_struct.py", line 171, in __init__ self.extractor = VitExtractor(model_name="dino_vitb8", device="cuda") File "/mnt/data1/download_new/img2img-turbo/src/my_utils/dino_struct.py", line 23, in __init__ self.model = torch.hub.load('facebookresearch/dino:main', model_name).to(device) File "/mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torch/hub.py", line 558, in load model = _load_local(repo_or_dir, model, *args, **kwargs) File "/mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torch/hub.py", line 587, in _load_local model = entry(*args, **kwargs) File "/root/.cache/torch/hub/facebookresearch_dino_main/hubconf.py", line 74, in dino_vitb8 state_dict = torch.hub.load_state_dict_from_url( File "/mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torch/hub.py", line 750, in load_state_dict_from_url return torch.load(cached_file, map_location=map_location) File "/mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torch/serialization.py", line 797, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "/mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torch/serialization.py", line 283, in __init__ super().__init__(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory wandb: 🚀 View run glorious-butterfly-9 at: https://wandb.ai/screw/gparmar_unpaired_h2z_cycle_debug_v2/runs/4qyk19p6 wandb: ⭐️ View project at: https://wandb.ai/screw/gparmar_unpaired_h2z_cycle_debug_v2 wandb: Synced 6 W&B file(s), 6 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20240910_173030-4qyk19p6/logs wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out withwandb.require("core")`! See https://wandb.me/wandb-core for more information. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17710 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17709) of binary: /mnt/data1/conda/envs/img2img-turbo/bin/python Traceback (most recent call last): File "/mnt/data1/conda/envs/img2img-turbo/bin/accelerate", line 8, in sys.exit(main()) File "/mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1165, in launch_command multi_gpu_launcher(args) File "/mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher distrib_run.run(args) File "/mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/mnt/data1/conda/envs/img2img-turbo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/train_cyclegan_turbo.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-09-10_17:31:45 host : trans72-train-02 rank : 0 (local_rank: 0) exitcode : 1 (pid: 17709) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ `