Cannot Be Installed, Depenencies Missing or Incorrect

System Info / 系統信息

Windows 10 NVIDIA RTX 4090 Python 3.10.7

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Steps to reproduce:

Create new virtual environment with python -m venv venv
Activate through \venv\scripts\activate.bat
Install requirements file using python install -r requirements.txt

Errors change depending on what workarounds are used.

Using above steps: ❌ Error: "Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops." error from DeepSpeed (It appears to be a problem on DeepSpeed's end but there's no known solution)

Using steps found on https://huggingface.co/THUDM/CogVideoX-5b-I2V ⚠️ Error: "URLs must start with http://". This version of diffusers doesn't support local files. ✔️ Resolved by running localhost server. ❌ New error:

 File "D:\AI\CogVideo\venv\lib\site-packages\transformers\utils\import_utils.py", line 1639, in requires_backends
    raise ImportError("".join(failed))
ImportError:
T5Tokenizer requires the SentencePiece library but it was not found in your environment.

Install attempt of requirements again with no-deps ❌ pip install -r requirements.txt --no-deps results in removal/lack of cuda. ✔️ Resolved by running: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117 Looking in indexes: https://download.pytorch.org/whl/cu117 ⚠️ Red messages for "swissarmytransformer 0.4.12 requires boto3, which is not installed." ✔️ Script begins running, downloads resources, runs generation (~12 minutes) ❌ Error:

Traceback (most recent call last):
  File "D:\AI\CogVideo\my_test.py", line 18, in <module>
    video = pipe(
  File "D:\AI\CogVideo\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\AI\CogVideo\venv\lib\site-packages\diffusers\pipelines\cogvideo\pipeline_cogvideox_image2video.py", line 826, in __call__
    video = self.decode_latents(latents)
  File "D:\AI\CogVideo\venv\lib\site-packages\diffusers\pipelines\cogvideo\pipeline_cogvideox_image2video.py", line 406, in decode_latents
    frames = self.vae.decode(latents).sample
  File "D:\AI\CogVideo\venv\lib\site-packages\diffusers\utils\accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "D:\AI\CogVideo\venv\lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 1278, in decode
    decoded = self._decode(z).sample
  File "D:\AI\CogVideo\venv\lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 1235, in _decode
    return self.tiled_decode(z, return_dict=return_dict)
  File "D:\AI\CogVideo\venv\lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 1431, in tiled_decode
    tile, conv_cache = self.decoder(tile, conv_cache=conv_cache)
  File "D:\AI\CogVideo\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\CogVideo\venv\lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 963, in forward
    hidden_states, new_conv_cache["mid_block"] = self.mid_block(
  File "D:\AI\CogVideo\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\CogVideo\venv\lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 529, in forward
    hidden_states, new_conv_cache[conv_cache_key] = resnet(
  File "D:\AI\CogVideo\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\CogVideo\venv\lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 291, in forward
    hidden_states, new_conv_cache["norm1"] = self.norm1(hidden_states, zq, conv_cache=conv_cache.get("norm1"))
  File "D:\AI\CogVideo\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\CogVideo\venv\lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 177, in forward
    z_first = F.interpolate(z_first, size=f_first_size)
  File "D:\AI\CogVideo\venv\lib\site-packages\torch\nn\functional.py", line 3933, in interpolate
    return torch._C._nn.upsample_nearest3d(input, output_size, scale_factors)
RuntimeError: "upsample_nearest3d_out_frame" not implemented for 'BFloat16'

❌ No output file.

Expected behavior / 期待表现

In a new virtual environment pip install -r requirements.txt should install correct dependencies and allow for use of any default script or script found on HuggingFace

THUDM / CogVideo