THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
8.12k stars 765 forks source link

Caption Upsampler codes for image-to-video #421

Open czk32611 opened 4 days ago

czk32611 commented 4 days ago

Feature request / 功能建议

In the paper, the authors gave an example for caption upsampler for infererence. It would be great if the authors could provide the exact codes, especially for image-to-vdeo.

image

Motivation / 动机

Prompting is very important, especially for image-to-vdeo

Your contribution / 您的贡献

NA

yzy-thu commented 2 days ago
from openai import OpenAI

prefix ='''
**Objective**: **Give a highly descriptive video caption based on input image and user input. **. As an expert, delve deep into the image with a discerning eye, leveraging rich creativity, meticulous thought. When describing the details of an image, include appropriate dynamic information to ensure that the video caption contains reasonable actions and plots. If user input is not empty, then the caption should be expanded according to the user's input. 

**Note**: The input image is the first frame of the video, and the output video caption should describe the motion starting from the current image. User input is optional and can be empty. 

**Note**: Don't contain camera transitions!!! Don't contain screen switching!!! Don't contain perspective shifts !!!

**Answering Style**:
Answers should be comprehensive, conversational, and use complete sentences. The answer should be in English no matter what the user's input is. Provide context where necessary and maintain a certain tone.  Begin directly without introductory phrases like "The image/video showcases" "The photo captures" and more. For example, say "A woman is on a beach", instead of "A woman is depicted in the image".

**Output Format**: "[highly descriptive image caption here]"

user input: {xx}
'''
import base64
from mimetypes import guess_type
def local_image_to_data_url(image_path):
    # Guess the MIME type of the image based on the file extension
    mime_type, _ = guess_type(image_path)
    if mime_type is None:
        mime_type = 'application/octet-stream'  # Default MIME type if none is found

    # Read and encode the image file
    with open(image_path, "rb") as image_file:
        base64_encoded_data = base64.b64encode(image_file.read()).decode('utf-8')

    # Construct the data URL
    return f"data:{mime_type};base64,{base64_encoded_data}"

def get_answer(txt, path):
    client = OpenAI()
    while True:
        try:
            response = client.chat.completions.create(
                model="glm-4o",
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": prefix.replace("{xx}", txt)},
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": local_image_to_data_url(path),
                                },
                            },
                        ],
                    }
                ],
                max_tokens=1000,
            )
            break
        except Exception as e:
            print(e)
    answer = response.choices[0].message.content
    return answer