We then use this as context and ask GPT-4 to generate temporal localization questions that require further reasoning to answer. We also ask GPT-4 to simultaneously generate the answer that includes the queried start and end timestamps, along with the explanation about the reasoning process.

def get_caption_summary_prompt(gt_caption, predicted_captions):
    prompt_prefix_1 = "Generate a detailed and accurate description of a video based on the given ground-truth video caption and multiple frame-level captions. " \
                      "Use the following details to create a clear and complete narrative:\n"
    prompt_prefix_2 = "\nGround-truth Video Caption: "
    prompt_prefix_3 = "\nFrame-level Captions: "
    prompt_suffix = """\n\nInstructions for writing the detailed description:
    1. Focus on describing key visual details such as appearance, motion, sequence of actions, objects involved, and interactions between elements in the video.
    2. Check for consistency between the ground-truth caption and frame-level captions, and prioritize details that match the ground-truth caption. Ignore any conflicting or irrelevant details from the frame-level captions.
    3. Leave out any descriptions about the atmosphere, mood, style, aesthetics, proficiency, or emotional tone of the video.
    4. Make sure the description is no more than 20 sentences.
    5. Combine and organize information from all captions into one clear and detailed description, removing any repeated or conflicting details.
    6. Emphasize important points like the order of events, appearance and actions of people or objects, and any significant changes or movements.
    7. Do not mention that the information comes from ground-truth captions or frame-level captions.
    8. Give a brief yet thorough description, highlighting the key visual and temporal details while keeping it clear and easy to understand.
    Use your intelligence to combine and refine the captions into a brief yet informative description of the entire video."""

    # Create the prompt by iterating over the list_of_elements and formatting the template
    prompt = prompt_prefix_1
    prompt += f"{prompt_prefix_2}{gt_caption}{prompt_prefix_3}{'; '.join(predicted_captions)}"
    prompt += prompt_suffix

    return prompt

NVlabs / LITA

We then use this as context and ask GPT-4 to generate temporal localization questions that require further reasoning to answer. We also ask GPT-4 to simultaneously generate the answer that includes the queried start and end timestamps, along with the explanation about the reasoning process. #6