genmoai / mochi

The best OSS video generation models
Apache License 2.0
2.13k stars 213 forks source link

diffusers support #41

Open feizc opened 3 weeks ago

feizc commented 3 weeks ago

Hi, thanks for great works!

I have used diffusers' script to convert CKPT, and it only took a few line code to generate excellent results.

A simple example as:

from diffusers import MochiPipeline
from diffusers.utils import export_to_video

pipe = MochiPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
frames = pipe(prompt, 
    num_inference_steps=50, 
    guidance_scale=4.5,
    num_frames=61,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(frames, "mochi.mp4")

Huggingface ckpt can download from: https://huggingface.co/feizhengcong/mochi-1-preview-diffusers

Some results:

https://github.com/user-attachments/assets/8b7b6f32-c4e3-4d29-afe8-c4607ec216a7

https://github.com/user-attachments/assets/fc8f828a-5f46-4d9c-b7c7-2ba8a32d5718

johnwick123f commented 3 weeks ago

Thanks, one question I have is how fast is it? And on what device?

feizc commented 3 weeks ago

Thanks, one question I have is how fast is it? And on what device?

It takes 4min to generate a 6s video in one A100, about 36G memmory.

image

image

zishen-ucap commented 3 weeks ago

Thanks, one question I have is how fast is it? And on what device?

It takes 4min to generate a 6s video in one A100, about 36G memmory.

image

image

C9091E61-71BF-43e9-A223-56951D2552C8

我发现diffusers的推理结果和genmo的不一样,而且推理同样的帧数和去噪步数,发现差不多要26分钟。。。我不知道是不是我放的参数有问题,你能帮我看一下吗?

from diffusers import MochiPipeline
from diffusers.utils import export_to_video
import torch

model_path = "mochi-1-preview-diffusers"

pipe = MochiPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = "A stark white cat with piercing green eyes stealthily creeps along the glossy, dark wooden floorboards of an opulent European-style parlor. Intricate wainscoting and luxurious wallpaper frame the scene. In a heart-racing moment, the cat pounces with precision towards a unsuspecting brown mouse near an ornate, mahogany furniture piece. The high-definition footage captures every nuanced movement, from the flutter of the cat's whiskers to the mouse's flickering tail, as an antique grandfather clock ticks rhythmically in the background. The elegant room's vintage charm contrasts with the primal, instinctive encounter unfolding within it."
frames = pipe(prompt, 
    num_inference_steps=99, 
    guidance_scale=4.5,
    num_frames=163,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(frames, "mochi.mp4")

https://github.com/user-attachments/assets/b9cec603-5ae8-40cd-bec5-9bd9ecfb6479

这个是diffusers的参数和推理结果

@click.command() @click.option("--prompt", required=True, help="Prompt for video generation.") @click.option( "--negative_prompt", default="", help="Negative prompt for video generation." ) @click.option("--width", default=848, type=int, help="Width of the video.") @click.option("--height", default=480, type=int, help="Height of the video.") @click.option("--num_frames", default=163, type=int, help="Number of frames.") @click.option("--seed", default=42, type=int, help="Random seed.") @click.option("--cfg_scale", default=4.5, type=float, help="CFG Scale.") @click.option( "--num_steps", default=100, type=int, help="Number of inference steps." ) @click.option("--model_dir", required=True, help="Path to the model directory.")

https://github.com/user-attachments/assets/3276da72-3bea-493a-8141-a267c2813af5

https://github.com/genmoai/models/blob/main/demos/cli.py 这个是genmo原版的参数和推理结果

feizc commented 3 weeks ago

Hi, you can take a look at other issues. The author said they would "up sample" the prompt. This has a significant impact on the final generated result. A simple approach is to use LLM expand and generate more beautiful videos:)

Thanks, one question I have is how fast is it? And on what device?

It takes 4min to generate a 6s video in one A100, about 36G memmory. image image

C9091E61-71BF-43e9-A223-56951D2552C8

我发现diffusers的推理结果和genmo的不一样,而且推理同样的帧数和去噪步数,发现差不多要26分钟。。。我不知道是不是我放的参数有问题,你能帮我看一下吗?

from diffusers import MochiPipeline
from diffusers.utils import export_to_video
import torch

model_path = "mochi-1-preview-diffusers"

pipe = MochiPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = "A stark white cat with piercing green eyes stealthily creeps along the glossy, dark wooden floorboards of an opulent European-style parlor. Intricate wainscoting and luxurious wallpaper frame the scene. In a heart-racing moment, the cat pounces with precision towards a unsuspecting brown mouse near an ornate, mahogany furniture piece. The high-definition footage captures every nuanced movement, from the flutter of the cat's whiskers to the mouse's flickering tail, as an antique grandfather clock ticks rhythmically in the background. The elegant room's vintage charm contrasts with the primal, instinctive encounter unfolding within it."
frames = pipe(prompt, 
    num_inference_steps=99, 
    guidance_scale=4.5,
    num_frames=163,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(frames, "mochi.mp4")

mochi.mp4 这个是diffusers的参数和推理结果

@click.command() @click.option("--prompt", required=True, help="Prompt for video generation.") @click.option( "--negative_prompt", default="", help="Negative prompt for video generation." ) @click.option("--width", default=848, type=int, help="Width of the video.") @click.option("--height", default=480, type=int, help="Height of the video.") @click.option("--num_frames", default=163, type=int, help="Number of frames.") @click.option("--seed", default=42, type=int, help="Random seed.") @click.option("--cfg_scale", default=4.5, type=float, help="CFG Scale.") @click.option( "--num_steps", default=100, type=int, help="Number of inference steps." ) @click.option("--model_dir", required=True, help="Path to the model directory.")

output.mp4 https://github.com/genmoai/models/blob/main/demos/cli.py 这个是genmo原版的参数和推理结果

LPengYang commented 3 weeks ago

Thanks, one question I have is how fast is it? And on what device?

It takes 4min to generate a 6s video in one A100, about 36G memmory.

image

image

Hello, it is obseved that the file size in diffusers is only about half of vanilla version, which cases such difference?

HanLiii commented 3 weeks ago

Hi @feizc , thanks for great works!

I failed to load the pipeline using following code:

from diffusers import MochiPipeline
import torch

pipe = MochiPipeline.from_pretrained("feizhengcong/mochi-1-preview-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")

The following is the error message:

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.50it/s]
Loading pipeline components...:  60%|██████████████████████████████████████████████████████████▊                                       | 3/5 [00:59<00:39, 19.70s/it]
Traceback (most recent call last):
  File "/data/hli358/envs/mochi/lib/python3.10/site-packages/diffusers/models/model_loading_utils.py", line 140, in load_state_dict
    file_extension = os.path.basename(checkpoint_file).split(".")[-1]
  File "/data/hli358/envs/mochi/lib/python3.10/posixpath.py", line 142, in basename
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/hli358/han/mochi/download.py", line 4, in <module>
    pipe = MochiPipeline.from_pretrained("feizhengcong/mochi-1-preview-diffusers", torch_dtype=torch.bfloat16)
  File "/data/hli358/envs/mochi/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/data/hli358/envs/mochi/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 896, in from_pretrained
    loaded_sub_model = load_sub_model(
  File "/data/hli358/envs/mochi/lib/python3.10/site-packages/diffusers/pipelines/pipeline_loading_utils.py", line 704, in load_sub_model
    loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
  File "/data/hli358/envs/mochi/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/data/hli358/envs/mochi/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 940, in from_pretrained
    state_dict = load_state_dict(model_file, variant=variant)
  File "/data/hli358/envs/mochi/lib/python3.10/site-packages/diffusers/models/model_loading_utils.py", line 152, in load_state_dict
    with open(checkpoint_file) as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
zishen-ucap commented 3 weeks ago

我能理解你说的是提示词优化吗?但是我发现好像把提示词丢进t5前,并没有做其他处理

def get_conditioning(self, prompts, *, zero_last_n_prompts: int):
        B = len(prompts)
        assert (
            0 <= zero_last_n_prompts <= B
        ), f"zero_last_n_prompts should be between 0 and {B}, got {zero_last_n_prompts}"
        tokenize_kwargs = dict(
            prompt=prompts,
            padding="max_length",
            return_tensors="pt",
            truncation=True,
        )
        print(f'丢进t5前的提示词:{prompts}')
        t5_toks = self.t5_tokenizer(**tokenize_kwargs, max_length=MAX_T5_TOKEN_LENGTH)

得到的结果是: (T2VSynthMochiModel pid=616962) 丢进t5前的提示词:["A stark white cat with piercing green eyes stealthily creeps along the glossy, dark wooden floorboards of an opulent European-style parlor. Intricate wainscoting and luxurious wallpaper frame the scene. In a heart-racing moment, the cat pounces with precision towards a unsuspecting brown mouse near an ornate, mahogany furniture piece. The high-definition footage captures every nuanced movement, from the flutter of the cat's whiskers to the mouse's flickering tail, as an antique grandfather clock ticks rhythmically in the background. The elegant room's vintage charm contrasts with the primal, instinctive encounter unfolding within it.", ''] [repeated 3x across cluster] 或者说genmo开源项目里面有这个 "up sample"的方法,我没注意到,你能告诉我,具体在哪里吗?

Hi, you can take a look at other issues. The author said they would "up sample" the prompt. This has a significant impact on the final generated result. A simple approach is to use LLM expand and generate more beautiful videos:)

Thanks, one question I have is how fast is it? And on what device?

It takes 4min to generate a 6s video in one A100, about 36G memmory. image image

C9091E61-71BF-43e9-A223-56951D2552C8 我发现diffusers的推理结果和genmo的不一样,而且推理同样的帧数和去噪步数,发现差不多要26分钟。。。我不知道是不是我放的参数有问题,你能帮我看一下吗?

from diffusers import MochiPipeline
from diffusers.utils import export_to_video
import torch

model_path = "mochi-1-preview-diffusers"

pipe = MochiPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = "A stark white cat with piercing green eyes stealthily creeps along the glossy, dark wooden floorboards of an opulent European-style parlor. Intricate wainscoting and luxurious wallpaper frame the scene. In a heart-racing moment, the cat pounces with precision towards a unsuspecting brown mouse near an ornate, mahogany furniture piece. The high-definition footage captures every nuanced movement, from the flutter of the cat's whiskers to the mouse's flickering tail, as an antique grandfather clock ticks rhythmically in the background. The elegant room's vintage charm contrasts with the primal, instinctive encounter unfolding within it."
frames = pipe(prompt, 
    num_inference_steps=99, 
    guidance_scale=4.5,
    num_frames=163,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(frames, "mochi.mp4")

mochi.mp4 这个是diffusers的参数和推理结果 @click.command() @click.option("--prompt", required=True, help="Prompt for video generation.") @click.option( "--negative_prompt", default="", help="Negative prompt for video generation." ) @click.option("--width", default=848, type=int, help="Width of the video.") @click.option("--height", default=480, type=int, help="Height of the video.") @click.option("--num_frames", default=163, type=int, help="Number of frames.") @click.option("--seed", default=42, type=int, help="Random seed.") @click.option("--cfg_scale", default=4.5, type=float, help="CFG Scale.") @click.option( "--num_steps", default=100, type=int, help="Number of inference steps." ) @click.option("--model_dir", required=True, help="Path to the model directory.") output.mp4 https://github.com/genmoai/models/blob/main/demos/cli.py 这个是genmo原版的参数和推理结果

feizc commented 3 weeks ago

Hi, @HanLiii , since vae encoder is provided, the convert script is updated and i re-upload the ckpts. You can try now :)

Hi @feizc , thanks for great works!

I failed to load the pipeline using following code:

from diffusers import MochiPipeline
import torch

pipe = MochiPipeline.from_pretrained("feizhengcong/mochi-1-preview-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")

The following is the error message:

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.50it/s]
Loading pipeline components...:  60%|██████████████████████████████████████████████████████████▊                                       | 3/5 [00:59<00:39, 19.70s/it]
Traceback (most recent call last):
  File "/data/hli358/envs/mochi/lib/python3.10/site-packages/diffusers/models/model_loading_utils.py", line 140, in load_state_dict
    file_extension = os.path.basename(checkpoint_file).split(".")[-1]
  File "/data/hli358/envs/mochi/lib/python3.10/posixpath.py", line 142, in basename
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/hli358/han/mochi/download.py", line 4, in <module>
    pipe = MochiPipeline.from_pretrained("feizhengcong/mochi-1-preview-diffusers", torch_dtype=torch.bfloat16)
  File "/data/hli358/envs/mochi/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/data/hli358/envs/mochi/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 896, in from_pretrained
    loaded_sub_model = load_sub_model(
  File "/data/hli358/envs/mochi/lib/python3.10/site-packages/diffusers/pipelines/pipeline_loading_utils.py", line 704, in load_sub_model
    loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
  File "/data/hli358/envs/mochi/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/data/hli358/envs/mochi/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 940, in from_pretrained
    state_dict = load_state_dict(model_file, variant=variant)
  File "/data/hli358/envs/mochi/lib/python3.10/site-packages/diffusers/models/model_loading_utils.py", line 152, in load_state_dict
    with open(checkpoint_file) as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType