cvlab-columbia / viper

Code for the paper "ViperGPT: Visual Inference via Python Execution for Reasoning"
Other
1.63k stars 117 forks source link

How to input videos? #20

Closed shivanshpatel35 closed 6 months ago

shivanshpatel35 commented 1 year ago

Hi,

Is there a load_video function similar to load_image function that can load the video to be sent to the model?

Thanks

surisdi commented 1 year ago

Hi, it is implemented here. It is defined for the dataset class that is used in main_batch.py, but you can adapt it for the main_simple.ipynb script.

shivanshpatel35 commented 1 year ago

Following the above code for get_video, I made a video tensor of shape (num_frames, num_channel, height, width) and data type torch.uint8. Then I passed it to execute_code as:

execute_code(code, frames, show_intermediate_steps=False)

It generates the following code:

def execute_command(image):                                                                                  
image_patch = ImagePatch(image)
...
...

This leads to the following error

Error in glip model: pic should be 2/3 dimensional. Got 4 dimensions.

It looks like this is because the model was expecting image input, but it got video input. How can I resolve this?

surisdi commented 1 year ago

ImagePatch assumes the input is an image. The code should start with def execute_command(video): (this is the last line of the prompt). In order to do that, change input_type to be video, not image. That will make the code LLM generate code that instead of calling ImagePatch, calls VideoSegment.

shivanshpatel35 commented 1 year ago

Thanks for the response! On changing the input_type to video here, I get the following error:

  0 def execute_command(video):                                                                                                           
  1     video_segment = VideoSegment(video)                                                                                               
  2     # TODO: Implement logic to determine what is happening in the video                                                          
  3     return "TODO: Implement logic to determine what is happening in the video"                                                   
<IPython.core.display.HTML object>
Encountered error execute_command() takes 1 positional argument but 4 were given when trying to run with visualizations. Trying from 
scratch.

   0 def execute_command(video):                                                                                                          
   1     video_segment = VideoSegment(video)                                                                                              
   2     return 'TODO: Implement logic to determine what is happening in the video'                                                  

Did I change the input_type at the correct place? Also, I deleted ImagePatch part in api.prompt to account for gpt3.5

surisdi commented 1 year ago

What prompt file are you using?

shivanshpatel35 commented 1 year ago

For the codex model, the prompt file is ./prompt/apt.prompt. And I changed the model to gpt-3.5-turbo. Here is the updated prompt file after deleting ImagePatch:

import math

def best_image_match(list_patches: List[ImagePatch], content: List[str], return_index=False) -> Union[ImagePatch, int]:
    """Returns the patch most likely to contain the content.
    Parameters
    ----------
    list_patches : List[ImagePatch]
    content : List[str]
        the object of interest
    return_index : bool
        if True, returns the index of the patch most likely to contain the object

    Returns
    -------
    int
        Patch most likely to contain the object
    """
    return best_image_match(list_patches, content, return_index)

def distance(patch_a: ImagePatch, patch_b: ImagePatch) -> float:
    """
    Returns the distance between the edges of two ImagePatches. If the patches overlap, it returns a negative distance
    corresponding to the negative intersection over union.

    Parameters
    ----------
    patch_a : ImagePatch
    patch_b : ImagePatch

    Examples
    --------
    # Return the qux that is closest to the foo
    >>> def execute_command(image):
    >>>     image_patch = ImagePatch(image)
    >>>     qux_patches = image_patch.find('qux')
    >>>     foo_patches = image_patch.find('foo')
    >>>     foo_patch = foo_patches[0]
    >>>     qux_patches.sort(key=lambda x: distance(x, foo_patch))
    >>>     return qux_patches[0]
    """
    return distance(patch_a, patch_b)

def bool_to_yesno(bool_answer: bool) -> str:
    return "yes" if bool_answer else "no"

def coerce_to_numeric(string):
    """
    This function takes a string as input and returns a float after removing any non-numeric characters.
    If the input string contains a range (e.g. "10-15"), it returns the first value in the range.
    """
    return coerce_to_numeric(string)

class VideoSegment:
    """A Python class containing a set of frames represented as ImagePatch objects, as well as relevant information.
    Attributes
    ----------
    video : torch.Tensor
        A tensor of the original video.
    start : int
        An int describing the starting frame in this video segment with respect to the original video.
    end : int
        An int describing the ending frame in this video segment with respect to the original video.
    num_frames->int
        An int containing the number of frames in the video segment.

    Methods
    -------
    frame_iterator->Iterator[ImagePatch]
    trim(start, end)->VideoSegment
        Returns a new VideoSegment containing a trimmed version of the original video at the [start, end] segment.
    frame_iterator->Iterator[ImagePatch]
        Returns an iterator over the frames in the video segment.
    """

    def __init__(self, video: torch.Tensor, start: int = None, end: int = None, parent_start=0, queues=None):
        """Initializes a VideoSegment object by trimming the video at the given [start, end] times and stores the
        start and end times as attributes. If no times are provided, the video is left unmodified, and the times are
        set to the beginning and end of the video.

        Parameters
        -------
        video : torch.Tensor
            A tensor of the original video.
        start : int
            An int describing the starting frame in this video segment with respect to the original video.
        end : int
            An int describing the ending frame in this video segment with respect to the original video.
        """

        if start is None and end is None:
            self.trimmed_video = video
            self.start = 0
            self.end = video.shape[0]  # duration
        else:
            self.trimmed_video = video[start:end]
            if start is None:
                start = 0
            if end is None:
                end = video.shape[0]
            self.start = start + parent_start
            self.end = end + parent_start

        self.num_frames = self.trimmed_video.shape[0]

    def frame_from_index(self, index) -> ImagePatch:
        """Returns the frame at position 'index', as an ImagePatch object.

        Examples
        -------
        >>> # Is there a foo in the frame bar appears?
        >>> def execute_command(video)->bool:
        >>>     video_segment = VideoSegment(video)
        >>>     for i, frame in enumerate(video_segment.frame_iterator()):
        >>>         if frame.exists("bar"):
        >>>             frame_after = video_segment.frame_from_index(i+1)
        >>>             return frame_after.exists("foo")
        """
        return ImagePatch(self.trimmed_video[index])

    def trim(self, start: Union[int, None] = None, end: Union[int, None] = None) -> VideoSegment:
        """Returns a new VideoSegment containing a trimmed version of the original video at the [start, end]
        segment.

        Parameters
        ----------
        start : Union[int, None]
            An int describing the starting frame in this video segment with respect to the original video.
        end : Union[int, None]
            An int describing the ending frame in this video segment with respect to the original video.

        Examples
        --------
        >>> # Return the second half of the video
        >>> def execute_command(video):
        >>>     video_segment = VideoSegment(video)
        >>>     video_second_half = video_segment.trim(video_segment.num_frames // 2, video_segment.num_frames)
        >>>     return video_second_half
        """
        if start is not None:
            start = max(start, 0)
        if end is not None:
            end = min(end, self.num_frames)

        return VideoSegment(self.trimmed_video, start, end, self.start)

    def frame_iterator(self) -> Iterator[ImagePatch]:
        """Returns an iterator over the frames in the video segment.

        Examples
        -------
        >>> # Return the frame when the kid kisses the cat
        >>> def execute_command(video):
        >>>     video_segment = VideoSegment(video)
        >>>     for i, frame in enumerate(video_segment.frame_iterator()):
        >>>         if frame.exists("kid") and frame.exists("cat") and frame.simple_query("Is the kid kissing the cat?") == "yes":
        >>>             return frame
        """
        for i in range(self.num_frames):
            yield self.frame_from_index(i)

# Examples of how to use the API
# INSERT_QUERY_HERE
def execute_command(INSERT_TYPE_HERE):
surisdi commented 1 year ago

Could you provide an example of a query where you get that result? I would not remove the whole ImagePatch, as it is also used for video operations (per-frame operations). Just remove the methods that will not be needed in your application.

Also, for code LLMs, probably you will get better results using the chatapi.prompt file (you will have to add the VideoSegment if you need video).

devaansh100 commented 1 year ago

Hello, adding something that worked for me here. Hopefully, it would be helpful:

First using chatapi.prompt, generate the code as if it was for an image, then in another prompt file, provide the VideoSegment class only along with the generated image code. Ask ChatGPT to "Modify the code written for image to videos using the VideoSegment class".

Would be glad to provide more specifics!

knightyxp commented 1 year ago

Hello, I'm curious about something in your code base. It seems that cahatapi.prompt is only used once. You mentioned another prompt - could you point me to where in the code it is utilized? I am also currently trying to input a video.

devaansh100 commented 1 year ago

It is not mentioned in the code, you will have to manually add it in. I put it here since it worked for me and was a way to use ChatGPT with videos.

Here is the prompt after using chatapi.prompt:

import math

class VideoSegment:
    """A Python class containing a set of frames represented as ImagePatch objects, as well as relevant information.
    Attributes
    ----------
    video : torch.Tensor
        A tensor of the original video.
    start : int
        An int describing the starting frame in this video segment with respect to the original video.
    end : int
        An int describing the ending frame in this video segment with respect to the original video.
    num_frames->int
        An int containing the number of frames in the video segment.

    Methods
    -------
    frame_iterator->Iterator[ImagePatch]
    trim(start, end)->VideoSegment
        Returns a new VideoSegment containing a trimmed version of the original video at the [start, end] segment.
    frame_iterator->Iterator[ImagePatch]
        Returns an iterator over the frames in the video segment.
    """

    def __init__(self, video: torch.Tensor, start: int = None, end: int = None, parent_start=0, queues=None):
        """Initializes a VideoSegment object by trimming the video at the given [start, end] times and stores the
        start and end times as attributes. If no times are provided, the video is left unmodified, and the times are
        set to the beginning and end of the video.

        Parameters
        -------
        video : torch.Tensor
            A tensor of the original video.
        start : int
            An int describing the starting frame in this video segment with respect to the original video.
        end : int
            An int describing the ending frame in this video segment with respect to the original video.
        """

        if start is None and end is None:
            self.trimmed_video = video
            self.start = 0
            self.end = video.shape[0]  # duration
        else:
            self.trimmed_video = video[start:end]
            if start is None:
                start = 0
            if end is None:
                end = video.shape[0]
            self.start = start + parent_start
            self.end = end + parent_start

        self.num_frames = self.trimmed_video.shape[0]

    def frame_from_index(self, index) -> ImagePatch:
        """Returns the frame at position 'index', as an ImagePatch object.

        Examples
        -------
        >>> # Is there a foo in the frame bar appears?
        >>> def execute_command(video)->bool:
        >>>     video_segment = VideoSegment(video)
        >>>     for i, frame in enumerate(video_segment.frame_iterator()):
        >>>         if frame.exists("bar"):
        >>>             frame_after = video_segment.frame_from_index(i+1)
        >>>             return frame_after.exists("foo")
        """
        return ImagePatch(self.trimmed_video[index])

    def trim(self, start: Union[int, None] = None, end: Union[int, None] = None) -> VideoSegment:
        """Returns a new VideoSegment containing a trimmed version of the original video at the [start, end]
        segment.

        Parameters
        ----------
        start : Union[int, None]
            An int describing the starting frame in this video segment with respect to the original video.
        end : Union[int, None]
            An int describing the ending frame in this video segment with respect to the original video.

        Examples
        --------
        >>> # Return the second half of the video
        >>> def execute_command(video):
        >>>     video_segment = VideoSegment(video)
        >>>     video_second_half = video_segment.trim(video_segment.num_frames // 2, video_segment.num_frames)
        >>>     return video_second_half
        """
        if start is not None:
            start = max(start, 0)
        if end is not None:
            end = min(end, self.num_frames)

        return VideoSegment(self.trimmed_video, start, end, self.start)

    def frame_iterator(self) -> Iterator[ImagePatch]:
        """Returns an iterator over the frames in the video segment.

        Examples
        -------
        >>> # Return the frame when the kid kisses the cat
        >>> def execute_command(video):
        >>>     video_segment = VideoSegment(video)
        >>>     for i, frame in enumerate(video_segment.frame_iterator()):
        >>>         if frame.exists("kid") and frame.exists("cat") and frame.simple_query("Is the kid kissing the cat?") == "yes":
        >>>             return frame
        """
        for i in range(self.num_frames):
            yield self.frame_from_index(i)

You are provided with code which uses an ImagePatch class which answers the query for an image. Modify the function using Python and the VideoSegment class (above) that could be executed to provide an answer to the query for a given video. Collect the result from all the frames and return the answer.

Consider the following guidelines:
- Use base Python (comparison, sorting) for basic logical operations, left/right/up/down, math, etc.
- Use the llm_query function to access external information and answer informational questions not concerning the image.

Code: INSERT_IMAGEPATCH_CODE_HERE
Query: INSERT_QUERY_HERE