genmoai / models

The best OSS video generation models
Apache License 2.0
1.19k stars 104 forks source link

Image to video model? #2

Open Hippogriff opened 4 days ago

Hippogriff commented 4 days ago

Does this approach also work in image-to-video setting?

adenek commented 4 days ago

Probably, but a confirmation will be great 😃

yangjiangeyjg commented 3 days ago

The same question!

Vinventive commented 3 days ago

Does this approach also work in image-to-video setting?

What is coming next? Today, we are releasing the Mochi 1 preview, showcasing the capabilities of our 480p base model. But this is just the beginning. Before the end of the year, we will release the full version of Mochi 1, which includes Mochi 1 HD. Mochi 1 HD will support 720p video generation with enhanced fidelity and even smoother motion, addressing edge cases such as warping in complex scenes.

Looking beyond this release, we are working on image-to-video capabilities. Additionally, we are focused on improving the controllability and steerability of the models to give our users even more precise control over their outputs.

Original Source: https://www.genmo.ai/blog

samrahimi commented 3 days ago

Will Mochi 1 HD continue to be an open source / open weights model? Because I think its truly brilliant that you decided to open source Mochi 1 - I can totally imagine how that meeting went... Was it something like:

Alice: Are we sure we want to do this? Mochi 1 is incredibly valuable intellectual property and we spent 100k on electricity to train it

Bob: Relax... The thing needs at least 320GB of VRAM to run, so 4 x H100 GPUs are kinda the minimum, even for a simple demo. Do you know how much these cost??

Alice: I thought we solved the VRAM problem over the summer and now Mochi will run on normal hardware, like the old P40 with 24GB of VRAM that one of the engineers got on Ebay for $50!

Bob: Shhhh... that's our special quantized version of the model. Eventually some engineer is gonna try and quantize the weights that we release - but it will take awhile because nobody's got the VRAM to load the original thing in the first place lmao

ajayjain commented 2 days ago

My name is Ajay @samrahimi

ajayjain commented 2 days ago

Jokes aside, we would love to make Mochi a nice, welcoming home for open source contributions! Please feel free to share PRs or bugs to help us make the model run faster, and on more accessible hardware. One of the big advantages of open source.

ajayjain commented 2 days ago

@Hippogriff While in preview, Mochi 1 only supports text-to-video. As a quick hack, you could describe the image extensively as the T5 XXL encoder supports long prompts, but that is a suboptimal solution since it'll lose most of the detail. We know I2V is important for the community so stay tuned.

abrichr commented 2 days ago

claude-3-5-sonnet-20240620:


There are a couple approaches we could take to implement multimodal support for image-to-video generation:

  1. Direct parameter sharing approach:

This would involve modifying the AsymmDiTJoint model to handle both text and image inputs:

Key changes:

class AsymmDiTJoint(nn.Module):
    def __init__(self, ...):
        # Add image encoder 
        self.image_encoder = VisionTransformer(...)

        # Modify other components as needed

    def forward(self, x, sigma, y_feat, y_mask, image):
        # Encode image
        image_features = self.image_encoder(image)

        # Combine text and image features
        combined_features = torch.cat([y_feat, image_features], dim=1)

        # Pass combined features through model
        ...

class AsymmetricAttention(nn.Module):
    def forward(self, x, y, image_features, ...):
        # Process x, y and image_features together
        ...
  1. Indirect approach using GPT-4V:

This approach keeps the existing model architecture and uses GPT-4V as a preprocessing step:

import openai

def preprocess_image(image_path, user_prompt):
    # Load and encode image
    with open(image_path, "rb") as image_file:
        encoded_image = base64.b64encode(image_file.read()).decode('utf-8')

    # Call GPT-4V 
    response = openai.ChatCompletion.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this image in great detail:"},
                    {"type": "image_url", "image_url": f"data:image/jpeg;base64,{encoded_image}"}
                ]
            }
        ]
    )

    image_description = response.choices[0].message.content

    # Combine with user prompt
    full_prompt = f"{user_prompt}\nDetailed image description: {image_description}"

    return full_prompt

# Use in main generation pipeline
def generate_video(prompt, image_path, ...):
    full_prompt = preprocess_image(image_path, prompt)
    # Pass full_prompt to existing text-to-video generation
    ...

The direct approach would likely yield better results as it allows the model to learn joint representations of text and images. However, it requires significant model architecture changes and retraining.

The indirect approach is easier to implement as a quick solution, leveraging GPT-4V's strong image understanding capabilities. But it may not capture the visual details as precisely as directly processing the image would.

abrichr commented 2 days ago

Another option:


There's a way to integrate more directly without requiring full retraining of the model. We can use a pre-trained vision model to extract image features and then inject these features into the existing text-to-video pipeline. This approach is often called "feature injection" or "cross-attention injection". Here's how we could implement this:

  1. Use a pre-trained vision model (e.g., CLIP) to extract image features
  2. Project these image features to match the dimensionality of the text features
  3. Concatenate the image features with the text features
  4. Adjust the attention mechanism to handle the additional image tokens

Here's a more detailed implementation approach:

  1. Add an image encoder and feature projector:
from torchvision.models import resnet50
from torchvision.transforms import Resize, ToTensor, Normalize

class ImageEncoder(nn.Module):
    def __init__(self, output_dim):
        super().__init__()
        self.resnet = resnet50(pretrained=True)
        self.resnet = nn.Sequential(*list(self.resnet.children())[:-1])  # Remove final FC layer
        self.projector = nn.Linear(2048, output_dim)
        self.transform = transforms.Compose([
            Resize((224, 224)),
            ToTensor(),
            Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        ])

    def forward(self, image):
        image = self.transform(image).unsqueeze(0)  # Add batch dimension
        features = self.resnet(image)
        features = features.view(features.size(0), -1)  # Flatten
        return self.projector(features)

# Add to AsymmDiTJoint
self.image_encoder = ImageEncoder(hidden_size_y)
  1. Modify the prepare method to handle image input:
def prepare(self, x, sigma, t5_feat, t5_mask, image):
    # ... existing code ...

    y_feat = self.t5_yproj(t5_feat)  # (B, L, t5_feat_dim) --> (B, L, D)

    # Extract and project image features
    image_features = self.image_encoder(image)  # (B, D)
    image_features = image_features.unsqueeze(1)  # (B, 1, D)

    # Concatenate text and image features
    y_feat = torch.cat([y_feat, image_features], dim=1)  # (B, L+1, D)

    return x, c, y_feat, rope_cos, rope_sin
  1. Adjust the forward method to accept an image input:
def forward(self, x, sigma, y_feat, y_mask, image, packed_indices=None, rope_cos=None, rope_sin=None):
    B, _, T, H, W = x.shape

    with sdpa_kernel(torch.nn.attention.SDPBackend.EFFICIENT_ATTENTION):
        x, c, y_feat, rope_cos, rope_sin = self.prepare(x, sigma, y_feat[0], y_mask[0], image)

    # ... rest of the method remains the same ...
  1. Update the attention mechanism in AsymmetricAttention:

The attention mechanism should already be able to handle the additional image token, as it's just treating it as an extra text token. However, you might want to add a special token type embedding to differentiate between text and image tokens:

class AsymmetricAttention(nn.Module):
    def __init__(self, ...):
        # ... existing code ...
        self.token_type_embedding = nn.Embedding(2, hidden_size_y)

    def forward(self, x, y, ...):
        # ... existing code ...

        # Add token type embeddings
        token_types = torch.zeros(y.shape[:2], dtype=torch.long, device=y.device)
        token_types[:, -1] = 1  # Set the last token (image token) to type 1
        y = y + self.token_type_embedding(token_types)

        # ... rest of the method remains the same ...

This approach allows us to inject image features into the existing text-to-video pipeline without requiring full retraining. The model can now consider both text and image inputs when generating video.

To use this modified model:

from PIL import Image

def generate_video(prompt, image_path, ...):
    image = Image.open(image_path).convert('RGB')
    # ... existing code to process prompt ...

    output = model(x, sigma, y_feat, y_mask, image, ...)

    # ... rest of video generation process ...

This method provides a more direct integration of image information compared to the GPT-4V approach, while still avoiding the need for full model retraining. It allows the model to learn to use both text and image features in its existing attention mechanisms.

abrichr commented 2 days ago

To ingest multiple images, video, and include video priors, we'll need to extend our model further. Here's an approach to incorporate these elements:

  1. Multiple Image Handling:
  2. Video Input Processing:
  3. Video Priors Integration:

Let's break down each of these components:

  1. Multiple Image Handling:

Modify the ImageEncoder to handle multiple images:

class MultiImageEncoder(nn.Module):
    def __init__(self, output_dim, max_images=5):
        super().__init__()
        self.resnet = resnet50(pretrained=True)
        self.resnet = nn.Sequential(*list(self.resnet.children())[:-1])
        self.projector = nn.Linear(2048, output_dim)
        self.max_images = max_images
        self.transform = transforms.Compose([
            Resize((224, 224)),
            ToTensor(),
            Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        ])

    def forward(self, images):
        batch_size = images[0].size(0)
        features_list = []
        for i in range(min(len(images), self.max_images)):
            img = self.transform(images[i]).unsqueeze(0)
            features = self.resnet(img)
            features = features.view(features.size(0), -1)
            features = self.projector(features)
            features_list.append(features)

        # Pad with zeros if less than max_images
        while len(features_list) < self.max_images:
            features_list.append(torch.zeros_like(features_list[0]))

        return torch.stack(features_list, dim=1)  # (B, max_images, D)

# Add to AsymmDiTJoint
self.image_encoder = MultiImageEncoder(hidden_size_y)
  1. Video Input Processing:

Add a video encoder to process video inputs:

import torchvision.models.video as video_models

class VideoEncoder(nn.Module):
    def __init__(self, output_dim, num_frames=16):
        super().__init__()
        self.r3d = video_models.r3d_18(pretrained=True)
        self.r3d = nn.Sequential(*list(self.r3d.children())[:-1])
        self.projector = nn.Linear(512, output_dim)
        self.num_frames = num_frames
        self.transform = transforms.Compose([
            transforms.Resize((112, 112)),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.43216, 0.394666, 0.37645], std=[0.22803, 0.22145, 0.216989])
        ])

    def forward(self, video):
        # Assume video is a tensor of shape (B, C, T, H, W)
        B, C, T, H, W = video.shape
        video = video.transpose(1, 2)  # (B, T, C, H, W)

        # Sample or pad to num_frames
        if T > self.num_frames:
            indices = torch.linspace(0, T-1, self.num_frames).long()
            video = video[:, indices]
        elif T < self.num_frames:
            padding = torch.zeros(B, self.num_frames - T, C, H, W, device=video.device)
            video = torch.cat([video, padding], dim=1)

        features = self.r3d(video)
        features = features.view(features.size(0), -1)
        return self.projector(features)  # (B, D)

# Add to AsymmDiTJoint
self.video_encoder = VideoEncoder(hidden_size_y)
  1. Video Priors Integration:

To incorporate video priors, we can add a learned embedding for common video attributes like motion, scene changes, etc. This can be implemented as a set of learnable parameters that are added to the video features:

class VideoPriors(nn.Module):
    def __init__(self, hidden_dim, num_priors=10):
        super().__init__()
        self.priors = nn.Parameter(torch.randn(num_priors, hidden_dim))
        self.projection = nn.Linear(hidden_dim, hidden_dim)

    def forward(self, video_features):
        # Compute attention weights
        attention = torch.matmul(video_features, self.priors.T)
        attention = F.softmax(attention, dim=-1)

        # Weighted sum of priors
        prior_features = torch.matmul(attention, self.priors)

        # Combine with video features
        combined = video_features + self.projection(prior_features)
        return combined

# Add to AsymmDiTJoint
self.video_priors = VideoPriors(hidden_size_y)

Now, let's modify the prepare method to incorporate all these new elements:

def prepare(self, x, sigma, t5_feat, t5_mask, images, video):
    # ... existing code ...

    y_feat = self.t5_yproj(t5_feat)  # (B, L, D)

    # Process multiple images
    image_features = self.image_encoder(images)  # (B, max_images, D)

    # Process video
    video_features = self.video_encoder(video)  # (B, D)

    # Apply video priors
    video_features = self.video_priors(video_features)  # (B, D)

    # Concatenate text, image, and video features
    y_feat = torch.cat([y_feat, image_features.view(B, -1, self.hidden_size_y), video_features.unsqueeze(1)], dim=1)  # (B, L+max_images+1, D)

    return x, c, y_feat, rope_cos, rope_sin

Finally, update the forward method:

def forward(self, x, sigma, y_feat, y_mask, images, video, packed_indices=None, rope_cos=None, rope_sin=None):
    B, _, T, H, W = x.shape

    with sdpa_kernel(torch.nn.attention.SDPBackend.EFFICIENT_ATTENTION):
        x, c, y_feat, rope_cos, rope_sin = self.prepare(x, sigma, y_feat[0], y_mask[0], images, video)

    # ... rest of the method remains the same ...

To use this extended model:

def generate_video(prompt, image_paths, input_video_path, ...):
    images = [Image.open(path).convert('RGB') for path in image_paths]
    video = load_video(input_video_path)  # Implement this function to load video as tensor

    # ... existing code to process prompt ...

    output = model(x, sigma, y_feat, y_mask, images, video, ...)

    # ... rest of video generation process ...

This approach allows the model to consider multiple input images, a reference video, and learned video priors when generating new videos. It provides a rich set of conditioning information that can guide the video generation process.

Remember that while this doesn't require retraining the entire model, you may want to fine-tune it to better utilize these new inputs. Also, you'll need to carefully manage the sequence length in the attention mechanisms, as we've significantly increased the number of tokens being processed.

abrichr commented 2 days ago

@ajayjain thank you for the excellent work! I wonder if you could comment on the suggested approaches 🙏 I would love to work on this when I get some free time, or at the very least help others to get started.