Caption Upsampler codes for image-to-video

from openai import OpenAI

prefix ='''
**Objective**: **Give a highly descriptive video caption based on input image and user input. **. As an expert, delve deep into the image with a discerning eye, leveraging rich creativity, meticulous thought. When describing the details of an image, include appropriate dynamic information to ensure that the video caption contains reasonable actions and plots. If user input is not empty, then the caption should be expanded according to the user's input. 

**Note**: The input image is the first frame of the video, and the output video caption should describe the motion starting from the current image. User input is optional and can be empty. 

**Note**: Don't contain camera transitions!!! Don't contain screen switching!!! Don't contain perspective shifts !!!

**Answering Style**:
Answers should be comprehensive, conversational, and use complete sentences. The answer should be in English no matter what the user's input is. Provide context where necessary and maintain a certain tone.  Begin directly without introductory phrases like "The image/video showcases" "The photo captures" and more. For example, say "A woman is on a beach", instead of "A woman is depicted in the image".

**Output Format**: "[highly descriptive image caption here]"

user input: {xx}
'''
import base64
from mimetypes import guess_type
def local_image_to_data_url(image_path):
    # Guess the MIME type of the image based on the file extension
    mime_type, _ = guess_type(image_path)
    if mime_type is None:
        mime_type = 'application/octet-stream'  # Default MIME type if none is found

    # Read and encode the image file
    with open(image_path, "rb") as image_file:
        base64_encoded_data = base64.b64encode(image_file.read()).decode('utf-8')

    # Construct the data URL
    return f"data:{mime_type};base64,{base64_encoded_data}"

def get_answer(txt, path):
    client = OpenAI()
    while True:
        try:
            response = client.chat.completions.create(
                model="glm-4o",
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": prefix.replace("{xx}", txt)},
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": local_image_to_data_url(path),
                                },
                            },
                        ],
                    }
                ],
                max_tokens=1000,
            )
            break
        except Exception as e:
            print(e)
    answer = response.choices[0].message.content
    return answer
THUDM / CogVideo

Caption Upsampler codes for image-to-video #421

Feature request / 功能建议

Motivation / 动机

Your contribution / 您的贡献