run error - Githubissues

tanshuai0219 commented 1 month ago

torchrun --standalone --nproc_per_node 1 scripts/inference.py --config configs/mvdit/inference/16x512x512.py /root/miniconda3/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /root/miniconda3/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten) /root/miniconda3/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration. warnings.warn( [2024-07-08 11:13:11,432] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) /root/miniconda3/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /root/miniconda3/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /root/miniconda3/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /mine_workspace/Mirage/repos/diffusers/src/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. torch.utils._pytree._register_pytree_node( Config (path: configs/mvdit/inference/16x512x512.py): {'num_frames': 16, 'fps': 8, 'image_size': (512, 512), 'model': {'type': 'MVDiT-XL/2', 'space_scale': 1.0, 'time_scale': 1.0, 'enable_flashattn': True, 'enable_layernorm_kernel': True, 'from_pretrained': '/mnt_alipayshnas/youtai.ts/checkpoints/OpenVid/MVDiT-16×512×512.pt'}, 'vae': {'type': 'VideoAutoencoderKL', 'from_pretrained': '/mnt_alipayshnas/youtai.ts/checkpoints/sd-vae-ft-ema/stabilityai__sd-vae-ft-ema', 'micro_batch_size': 2}, 'text_encoder': {'type': 't5', 'from_pretrained': '/mnt_alipayshnas/youtai.ts/checkpoints/t5-v1_1-xxl/t5-v1_1-xxl', 'model_max_length': 120}, 'scheduler': {'type': 'iddpm', 'num_sampling_steps': 100, 'cfg_scale': 7.0}, 'dtype': 'fp16', 'batch_size': 2, 'seed': 42, 'prompt_path': './assets/texts/evalcrafter.txt', 'start_idx': 0, 'end_idx': 700, 'save_dir': './outputs/samples/', 'multi_resolution': False} Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [04:23<00:00, 131.92s/it] Loading /mnt_alipayshnas/youtai.ts/checkpoints/OpenVid/MVDiT-16×512×512.pt Missing keys: ['pos_embed', 'pos_embed_temporal'] Unexpected keys: [] /ossfs/workspace/py310/workspace/OpenVid-1M/openvid/models/text_encoder/t5.py:163: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup. caption = BeautifulSoup(caption, features="html.parser").text 0%| | 0/100 [00:00<?, ?it/s] Traceback (most recent call last): File "/ossfs/workspace/py310/workspace/OpenVid-1M/scripts/inference.py", line 107, in <module> main() File "/ossfs/workspace/py310/workspace/OpenVid-1M/scripts/inference.py", line 87, in main samples = scheduler.sample( File "/ossfs/workspace/py310/workspace/OpenVid-1M/openvid/schedulers/iddpm/__init__.py", line 77, in sample samples = self.p_sample_loop( File "/ossfs/workspace/py310/workspace/OpenVid-1M/openvid/schedulers/iddpm/gaussian_diffusion.py", line 437, in p_sample_loop for sample in self.p_sample_loop_progressive( File "/ossfs/workspace/py310/workspace/OpenVid-1M/openvid/schedulers/iddpm/gaussian_diffusion.py", line 488, in p_sample_loop_progressive out = self.p_sample( File "/ossfs/workspace/py310/workspace/OpenVid-1M/openvid/schedulers/iddpm/gaussian_diffusion.py", line 391, in p_sample out = self.p_mean_variance( File "/ossfs/workspace/py310/workspace/OpenVid-1M/openvid/schedulers/iddpm/respace.py", line 94, in p_mean_variance return super().p_mean_variance(self._wrap_model(model), *args, **kwargs) File "/ossfs/workspace/py310/workspace/OpenVid-1M/openvid/schedulers/iddpm/gaussian_diffusion.py", line 270, in p_mean_variance model_output = model(x, t, **model_kwargs) File "/ossfs/workspace/py310/workspace/OpenVid-1M/openvid/schedulers/iddpm/respace.py", line 127, in __call__ return self.model(x, new_ts, **kwargs) File "/ossfs/workspace/py310/workspace/OpenVid-1M/openvid/schedulers/iddpm/__init__.py", line 94, in forward_with_cfg model_out = model.forward(combined, timestep, y, **kwargs) File "/ossfs/workspace/py310/workspace/OpenVid-1M/openvid/models/mvdit/mvdit.py", line 331, in forward x, y = auto_grad_checkpoint(block, x, y, t0, t_y, t0_tmep, t_y_tmep, mask, tpe) File "/ossfs/workspace/py310/workspace/OpenVid-1M/openvid/acceleration/checkpoint.py", line 24, in auto_grad_checkpoint return module(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/ossfs/workspace/py310/workspace/OpenVid-1M/openvid/models/mvdit/mvdit.py", line 162, in forward x = x + self.cross_attn(x, y, mask) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/ossfs/workspace/py310/workspace/OpenVid-1M/openvid/models/layers/blocks.py", line 399, in forward attn_bias[attn_bias==0] = exp RuntimeError: value cannot be converted to type at::Half without overflow

I get error when running mvdit, could u give some advice? But when I run stdit, it successfully generate videos.

CSRuiXie commented 1 month ago

Thank you for your attention to our work. You can set the dtype in the config to fp32, and then it should work.

madhuvanthp commented 1 month ago

Thank you for your attention to our work. You can set the dtype in the config to fp32, and then it should work.

inference.py: error: unrecognized arguments: configs/mvdit/inference/16x512x512.py
E0708 19:42:19.365000 139992540640320 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 2)

I still get this error when changing the dtype in the config file you specified to fp32.

raccooncoder commented 1 month ago

I switched dtype to bf16 and it worked; makes sense, because the model was trained in bf16.

SamitM1 commented 1 month ago

I switched dtype to bf16 and it worked; makes sense, because the model was trained in bf16.

How did you install apex? The command provided in this repo gives the following error for me:


  subprocess.CalledProcessError: Command '['which', 'g++']' returned non-zero exit status 1.
  error: subprocess-exited-with-error

  × Building wheel for apex (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /home/samit/anaconda3/envs/openvid/bin/python /home/samit/anaconda3/envs/openvid/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py build_wheel /tmp/tmpujlo36a8
  cwd: /tmp/pip-req-build-5l72eosr
  Building wheel for apex (pyproject.toml) ... error
  ERROR: Failed building wheel for apex
Failed to build apex
ERROR: Could not build wheels for apex, which is required to install pyproject.toml-based projects

Would greatly appreciate some help trouble shooting this.
If you used a different command to install apex could you please provide it @tanshuai0219 @raccooncoder

NJU-PCALab / OpenVid-1M

run error #4