Doubiiu / DynamiCrafter

[ECCV 2024] DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
Apache License 2.0
2.06k stars 161 forks source link

An error is generated when trying to learn: torch.distributed.elastic.multiprocessing.errors.ChildFailedError #81

Closed caslix closed 1 month ago

caslix commented 1 month ago

Hello! An error is returned when trying to learn. Tell me, what could be the problem? Thank you.


Configing Model LatentVisualDiffusion: Running in v-prediction mode AE working on z of shape (1, 4, 32, 32) = 4096 dimensions.

Load weights from pretrained checkpoint INFO:mainlogger:>>> Load weights from pretrained checkpoint ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 57853) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in main() File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./main/trainer.py FAILED

Failures:

------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-05-22_13:49:52 host : Workstation. rank : 0 (local_rank: 0) exitcode : -9 (pid: 57853) error_file: traceback : Signal 9 (SIGKILL) received by PID 57853 ======================================================
caslix commented 1 month ago

I found the reason for the error. Lack of RAM in WSL. Using a file .wslconfig increased the memory and everything worked.

caslix commented 1 month ago

Done!