Vchitect / Vchitect-2.0

Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models
https://vchitect.intern-ai.org.cn/
Apache License 2.0
534 stars 13 forks source link

Some questions and confirmation information #1

Open zRzRzRzRzRzRzR opened 1 week ago

zRzRzRzRzRzRzR commented 1 week ago

Dear Development Team,

Hello, I have successfully installed the model and run it according to the requirements in the README, but I have encountered some issues and look forward to your response.

  1. The repository does not seem to provide example prompt requirements, such as the language and how to structure the length properly. By reading the code, I gathered the following information:

In this case, I am unsure how to structure the prompt, so I simply wrote a prompt:

A little girl is riding a bicycle at high speed. Focused, detailed, realistic.

and set the seed to 42:

def set_seed(seed):
    random.seed(42)
    os.environ['PYTHONHASHSEED'] = str(42)
    np.random.seed(42)
    torch.manual_seed(42)
    torch.cuda.manual_seed(42)

I set the output to 720x480 according to the README, and configured it as follows:

                        prompt,
                        negative_prompt="",
                        num_inference_steps=50,
                        guidance_scale=7.5,
                        width=768,
                        height=432, #480x288  624x352 432x240 768x432
                        frames=8*20,
                    )

It occupied 67904MiB of GPU memory. The other parameters remained unchanged, with 50 sampling steps. The final video can be found here:

https://github.com/user-attachments/assets/61e84c11-490e-42ab-9703-3d78f7c1119a

Is this the expected result?

  1. I did not see any relevant details about I2V in the code, nor any place where an image can be used as input. Should I understand that this open-source model is a T2V model?

  2. It seems that there is no parameter to control the frame rate.

However, the video I generated only has 8 frames, with a total of 40 frames, as verified using the following command:

ffprobe -v 0 -of csv=p=0 -select_streams v:0 -show_entries stream=r_frame_rate sample_1_seed0.mp4

8/1

image

Is it because the open-source model only outputs 8 frames?

Additionally, there may be some issues in the code within the repository:

  1. https://github.com/Vchitect/Vchitect-2.0/blob/0ef47a5d3368a1a7b235d7c3511be24f5febf791/models/pipeline.py#L198 This should be modified to device = "cuda", or add device = "cuda" in: https://github.com/Vchitect/Vchitect-2.0/blob/0ef47a5d3368a1a7b235d7c3511be24f5febf791/inference.py#L15 Otherwise, a tensor not on the same device error will occur during pos embed.

  2. https://github.com/Vchitect/Vchitect-2.0/tree/master/models/__pycache__ Should this be deleted? It seems unnecessary.

Looking forward to your response.

foreverpiano commented 6 days ago

同问,目前不好复现,

WeichenFan commented 6 days ago

Hi ZR,

Thanks for your interest!

The repository does not seem to provide example prompt requirements, such as the language and how to structure the length properly. By reading the code, I gathered the following information: No negative_prompt The length of the input prompt should be < 77 Tokens (CLIP) The input must be in English.

Regarding the video length, we apologize that the current open-source version only supports videos shorter than 10 seconds. The timeline for open-sourcing additional versions, including the I2V model, is still undecided.

fahdmirza commented 5 days ago

@zRzRzRzRzRzRzR could you kindly share your full steps from start to end as how you created this video? How did you download checkpoints , what library versions did you use etc as I am trying to follow their READ ME but its incomplete. I am getting following error:

from models.VchitectXL import VchitectXLTransformerModel File "/home/Ubuntu/Vchitect-2.0/models/VchitectXL.py", line 34, in from torch.distributed.tensor.parallel import ( ImportError: cannot import name 'PrepareModuleOutput' from 'torch.distributed.tensor.parallel' (/home/Ubuntu/miniconda3/envs/VchitectXL/lib/python3.11/site-packages/torch/distributed/tensor/parallel/init.py)

fakerybakery commented 5 days ago

Regarding the video length, we apologize that the current open-source version only supports videos shorter than 10 seconds. The timeline for open-sourcing additional versions, including the I2V model, is still undecided.

Hi, Just to clarify, the examples from the webpage were generated using a different model than the open-source model? Thanks

zRzRzRzRzRzRzR commented 5 days ago

@zRzRzRzRzRzRzR could you kindly share your full steps from start to end as how you created this video? How did you download checkpoints , what library versions did you use etc as I am trying to follow their READ ME but its incomplete. I am getting following error:您能否友好地分享您从头到尾创建这个视频的完整步骤?您是如何下载检查点的,使用了哪些库版本等,因为我正在尝试按照他们的 README 操作,但它是残缺的。我遇到了以下错误:

from models.VchitectXL import VchitectXLTransformerModel从 models.VchitectXL 导入 VchitectXLTransformerModel File "/home/Ubuntu/Vchitect-2.0/models/VchitectXL.py", line 34, in 文件 "/home/Ubuntu/Vchitect-2.0/models/VchitectXL.py",第 34 行,在 from torch.distributed.tensor.parallel import (从 torch.distributed.tensor.parallel 导入 ( ImportError: cannot import name 'PrepareModuleOutput' from 'torch.distributed.tensor.parallel' (/home/Ubuntu/miniconda3/envs/VchitectXL/lib/python3.11/site-packages/torch/distributed/tensor/parallel/init.py)ImportError: 无法从 'torch.distributed.tensor.parallel' 导入名称 'PrepareModuleOutput' (/home/Ubuntu/miniconda3/envs/VchitectXL/lib/python3.11/site-packages/torch/distributed/tensor/parallel/init.py)

I wll upload later

zRzRzRzRzRzRzR commented 4 days ago

Step 1 test.txt has only one line

A little girl is riding a bicycle at high speed. Focused, detailed, realistic.

Step 2, modify the code inference.py:

def infer(args):
pipe = VchitectXLPipeline(args.ckpt_path)
idx = 0

Change to

def infer(args):
pipe = VchitectXLPipeline(args.ckpt_path,device="cuda")
idx = 0

Step 3, if you want to change the number of frames of the video generation length:

with torch.cuda.amp.autocast(dtype=torch.bfloat16):
video = pipe(
prompt,
negative_prompt="",
num_inference_steps=50,
guidance_scale=7.5,
width=768,
height=432, #480x288 624x352 432x240 768x432
frames=10*8, #Change here, seconds*frames (default is 8 frames)
)

Step 4, run the program:

CUDA_VISIBLE_DEVICES=8 python inference.py --test_file test.txt --save_dir output --ckpt_path Vchitect-XL-2B (the absolute path of the model you downloaded)

This will run, I believe it will help you.

zRzRzRzRzRzRzR commented 4 days ago

Hi ZR,

Thanks for your interest!

The repository does not seem to provide example prompt requirements, such as the language and how to structure the length properly. By reading the code, I gathered the following information: No negative_prompt The length of the input prompt should be < 77 Tokens (CLIP) The input must be in English.

* We will add more info to README ASAP.

  * negative prompt is supported;
  * The length of the input prompt can be greater than 77 tokens (T5 can accept longer prompts, CLIP cannot but is fine);
  * Yes, the input must be English;

Regarding the video length, we apologize that the current open-source version only supports videos shorter than 10 seconds. The timeline for open-sourcing additional versions, including the I2V model, is still undecided. How can I change the frame rate? The video currently generates at 8 frames per second.