PKU-YuanGroup / Open-Sora-Plan

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
MIT License
11.6k stars 1.03k forks source link

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #212

Open Taldhi opened 7 months ago

Taldhi commented 7 months ago

We have encountered the following errors while attempting to execute the train_vidae.sh script.

  1. torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
    [2024-04-10 10:23:00,020] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1115) of binary: /home/pritam/anaconda3/envs/opensora/bin/python Traceback (most recent call last): File "/home/pritam/anaconda3/envs/opensora/bin/accelerate", line 8, in sys.exit(main()) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1042, in launch_command deepspeed_launcher(args) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/accelerate/commands/launch.py", line 754, in deepspeed_launcher distrib_run.run(args) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

  2. AttributeError: 'FieldInfo' object has no attribute 'required'

LinB203 commented 7 months ago

This is due to a version conflict in the installation environment. Please follow the latest requirements.

Taldhi commented 7 months ago

This is due to a version conflict in the installation environment. Please follow the latest requirements.

I set up a fresh new enviounment using your updated requirements.txt , I am using only one gpu Quadro GV100 [32 gb ] I am using your scripts/test_condition/train_imageae.sh for training but still facing the same issue as follows

I used a small protion of the mixkit dataset and adjusted the json file accordingly

Steps: 0%| | 0/1000000 [00:00<?, ?it/s]Traceback (most recent call last): File "opensora/train/train_t2v.py", line 807, in main(args) File "opensora/train/train_t2v.py", line 439, in main loss_dict = diffusion.training_losses(model, x, t, model_kwargs) File "/home/pritam/workspace/Open-Sora-Plan/opensora/models/diffusion/diffusion/respace.py", line 166, in training_losses return super().training_losses(self._wrap_model(model), *args, kwargs) File "/home/pritam/workspace/Open-Sora-Plan/opensora/models/diffusion/diffusion/gaussian_diffusion_t2v.py", line 761, in training_losses model_output = model(x_t, t, model_kwargs) File "/home/pritam/workspace/Open-Sora-Plan/opensora/models/diffusion/diffusion/respace.py", line 198, in call return self.model(x, new_ts, kwargs) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1833, in forward loss = self.module(*inputs, *kwargs) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/home/pritam/workspace/Open-Sora-Plan/opensora/models/diffusion/latte/modeling_latte.py", line 974, in forward hidden_states = torch.utils.checkpoint.checkpoint( File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/_compile.py", line 24, in inner return torch._dynamo.disable(fn, recursive)(*args, *kwargs) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(args, kwargs) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/_dynamo/external_utils.py", line 17, in inner return fn(*args, kwargs) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 458, in checkpoint ret = function(*args, *kwargs) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/home/pritam/workspace/Open-Sora-Plan/opensora/models/diffusion/latte/modules.py", line 1501, in forward attn_output = self.attn1( File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/pritam/workspace/Open-Sora-Plan/opensora/models/diffusion/latte/modules.py", line 658, in forward return self.processor( File "/home/pritam/workspace/Open-Sora-Plan/opensora/models/diffusion/latte/modules.py", line 903, in call hidden_states = F.scaled_dot_product_attention( RuntimeError: cutlassF: no kernel found to launch! [2024-04-12 13:18:04,949] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1869) of binary: /home/pritam/anaconda3/envs/opensora/bin/python Traceback (most recent call last): File "/home/pritam/anaconda3/envs/opensora/bin/accelerate", line 8, in sys.exit(main()) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1042, in launch_command deepspeed_launcher(args) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/accelerate/commands/launch.py", line 754, in deepspeed_launcher distrib_run.run(args) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: opensora/train/train_t2v.py FAILED Failures:

Root Cause (first observed failure): [0]: time : 2024-04-12_13:18:04 host : ubuntu rank : 0 (local_rank: 0) exitcode : 1 (pid: 1869) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html Also when tried to train using sh scripts/text_condition/train_videoae_17x256x256.sh It throws an additional error as follows Traceback (most recent call last): File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status response.raise_for_status() File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/CausalVAEModel_4x8x8/resolve/main/config.json The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/diffusers/configuration_utils.py", line 380, in load_config config_file = hf_hub_download( File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn return fn(args, kwargs) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1403, in hf_hub_download raise head_call_error File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download metadata = get_hf_file_metadata( File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn return fn(args, kwargs) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1674, in get_hf_file_metadata r = _request_wrapper( File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 369, in _request_wrapper response = _request_wrapper( File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 393, in _request_wrapper hf_raise_for_status(response) File "/home/pritam/anaconda3/envs/opensora/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 352, in hf_raise_for_status raise RepositoryNotFoundError(message, response) from e huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-6619398d-556e9b364b021bb1567f2f12;81b954f8-ea61-4275-86e7-fb28d20fbe3b) **Repository Not Found for url: https://huggingface.co/CausalVAEModel_4x8x8/resolve/main/config.json.** Please make sure you specified the correct `repo_id` and `repo_type`. If you are trying to access a private or gated repo, make sure you are authenticated. Invalid username or password. Any suggestions will be greatly helpful. Thank you
Taldhi commented 7 months ago

@LinB203 please see this error again and give some suggestions accordingly . It would be really helpful for me to proceed . Thank You for your time.