LinQiWei-cs commented 1 month ago

I have the following questions, please help to answer: I trained unpaired cycle-gan-turbo using my own dataset. An error occurs after executing the training script:

File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/cleanfid/inception_torchscript.py", line 35, in init self.base = torch.jit.load(path).eval() File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/jit/_serialization.py", line 162, in load cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files, _restore_shapes) # type: ignore[call-arg] RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 834576) of binary: /home/JJ_Group/linqw2405/img2img-turbo/venv/bin/python3 Traceback (most recent call last): File "/home/JJ_Group/linqw2405/img2img-turbo/venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1088, in launch_command multi_gpu_launcher(args) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher distrib_run.run(args) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I checked that my vgg/dino weights were intact and not damaged. At the same time, the error is intermittent.

GaParmar commented 1 month ago

This error is from the clean-fid evaluation code. Can you try deleting this file /tmp/inception-2015-12-05.pt and see if the error persists? It looks like that file was not downloaded correctly. -Gaurav

zhangsngood commented 1 month ago

I have the following questions, please help to answer:

I trained unpaired cycle-gan-turbo using my own dataset. An error occurs after executing the training script:

File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/cleanfid/inception_torchscript.py", line 35, in init self.base = torch.jit.load(path).eval() File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/jit/_serialization.py", line 162, in load cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files, _restore_shapes) # type: ignore[call-arg] RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 834576) of binary: /home/JJ_Group/linqw2405/img2img-turbo/venv/bin/python3 Traceback (most recent call last): File "/home/JJ_Group/linqw2405/img2img-turbo/venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1088, in launch_command multi_gpu_launcher(args) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher distrib_run.run(args) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/JJ_Group/linqw2405/img2img-turbo/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I checked that my vgg/dino weights were intact and not damaged. At the same time, the error is intermittent.

请问这个问题解决了吗，我也遇到了一样的问题。

GaParmar commented 1 week ago

@zhangsngood Could you try deleting /tmp/inception-2015-12-05.pt and rerunning the code to see if the problem gets fixed?

xinlin-xiao commented 2 days ago

您能否尝试删除并重新运行代码以查看问题是否得到解决？/tmp/inception-2015-12-05.pt

@GaParmar deleting /tmp/inception-2015-12-05.pt and rerunning the code，can not fix it. Do you have any other suggestions?

xinlin-xiao commented 2 days ago

@GaParmar

I trained unpaired cycle-gan-turbo using this samples [training_cyclegan_turbo.md

](https://github.com/GaParmar/img2img-turbo/blob/main/docs/training_cyclegan_turbo.md)

it return RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

I try deleting /tmp/inception-2015-12-05.pt and rerunning the code，can not fix it. Do you have any other suggestions?

api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/train_cyclegan_turbo.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-09-10_17:31:45 host : trans72-train-02 rank : 0 (local_rank: 0) exitcode : 1 (pid: 17709) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ `

GaParmar / img2img-turbo

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory #74

I have the following questions, please help to answer: I trained unpaired cycle-gan-turbo using my own dataset. An error occurs after executing the training script:

I have the following questions, please help to answer:

I trained unpaired cycle-gan-turbo using this samples [training_cyclegan_turbo.md

I try deleting /tmp/inception-2015-12-05.pt and rerunning the code，can not fix it. Do you have any other suggestions?

src/train_cyclegan_turbo.py FAILED