Closed shivanshpatel35 closed 6 months ago
Hi, it is implemented here. It is defined for the dataset class that is used in main_batch.py
, but you can adapt it for the main_simple.ipynb
script.
Following the above code for get_video, I made a video tensor of shape (num_frames, num_channel, height, width) and data type torch.uint8. Then I passed it to execute_code as:
execute_code(code, frames, show_intermediate_steps=False)
It generates the following code:
def execute_command(image):
image_patch = ImagePatch(image)
...
...
This leads to the following error
Error in glip model: pic should be 2/3 dimensional. Got 4 dimensions.
It looks like this is because the model was expecting image input, but it got video input. How can I resolve this?
ImagePatch
assumes the input is an image. The code should start with def execute_command(video):
(this is the last line of the prompt). In order to do that, change input_type
to be video
, not image
. That will make the code LLM generate code that instead of calling ImagePatch
, calls VideoSegment
.
Thanks for the response! On changing the input_type
to video here, I get the following error:
0 def execute_command(video):
1 video_segment = VideoSegment(video)
2 # TODO: Implement logic to determine what is happening in the video
3 return "TODO: Implement logic to determine what is happening in the video"
<IPython.core.display.HTML object>
Encountered error execute_command() takes 1 positional argument but 4 were given when trying to run with visualizations. Trying from
scratch.
0 def execute_command(video):
1 video_segment = VideoSegment(video)
2 return 'TODO: Implement logic to determine what is happening in the video'
Did I change the input_type at the correct place? Also, I deleted ImagePatch part in api.prompt to account for gpt3.5
What prompt file are you using?
For the codex model, the prompt file is ./prompt/apt.prompt. And I changed the model to gpt-3.5-turbo. Here is the updated prompt file after deleting ImagePatch:
import math
def best_image_match(list_patches: List[ImagePatch], content: List[str], return_index=False) -> Union[ImagePatch, int]:
"""Returns the patch most likely to contain the content.
Parameters
----------
list_patches : List[ImagePatch]
content : List[str]
the object of interest
return_index : bool
if True, returns the index of the patch most likely to contain the object
Returns
-------
int
Patch most likely to contain the object
"""
return best_image_match(list_patches, content, return_index)
def distance(patch_a: ImagePatch, patch_b: ImagePatch) -> float:
"""
Returns the distance between the edges of two ImagePatches. If the patches overlap, it returns a negative distance
corresponding to the negative intersection over union.
Parameters
----------
patch_a : ImagePatch
patch_b : ImagePatch
Examples
--------
# Return the qux that is closest to the foo
>>> def execute_command(image):
>>> image_patch = ImagePatch(image)
>>> qux_patches = image_patch.find('qux')
>>> foo_patches = image_patch.find('foo')
>>> foo_patch = foo_patches[0]
>>> qux_patches.sort(key=lambda x: distance(x, foo_patch))
>>> return qux_patches[0]
"""
return distance(patch_a, patch_b)
def bool_to_yesno(bool_answer: bool) -> str:
return "yes" if bool_answer else "no"
def coerce_to_numeric(string):
"""
This function takes a string as input and returns a float after removing any non-numeric characters.
If the input string contains a range (e.g. "10-15"), it returns the first value in the range.
"""
return coerce_to_numeric(string)
class VideoSegment:
"""A Python class containing a set of frames represented as ImagePatch objects, as well as relevant information.
Attributes
----------
video : torch.Tensor
A tensor of the original video.
start : int
An int describing the starting frame in this video segment with respect to the original video.
end : int
An int describing the ending frame in this video segment with respect to the original video.
num_frames->int
An int containing the number of frames in the video segment.
Methods
-------
frame_iterator->Iterator[ImagePatch]
trim(start, end)->VideoSegment
Returns a new VideoSegment containing a trimmed version of the original video at the [start, end] segment.
frame_iterator->Iterator[ImagePatch]
Returns an iterator over the frames in the video segment.
"""
def __init__(self, video: torch.Tensor, start: int = None, end: int = None, parent_start=0, queues=None):
"""Initializes a VideoSegment object by trimming the video at the given [start, end] times and stores the
start and end times as attributes. If no times are provided, the video is left unmodified, and the times are
set to the beginning and end of the video.
Parameters
-------
video : torch.Tensor
A tensor of the original video.
start : int
An int describing the starting frame in this video segment with respect to the original video.
end : int
An int describing the ending frame in this video segment with respect to the original video.
"""
if start is None and end is None:
self.trimmed_video = video
self.start = 0
self.end = video.shape[0] # duration
else:
self.trimmed_video = video[start:end]
if start is None:
start = 0
if end is None:
end = video.shape[0]
self.start = start + parent_start
self.end = end + parent_start
self.num_frames = self.trimmed_video.shape[0]
def frame_from_index(self, index) -> ImagePatch:
"""Returns the frame at position 'index', as an ImagePatch object.
Examples
-------
>>> # Is there a foo in the frame bar appears?
>>> def execute_command(video)->bool:
>>> video_segment = VideoSegment(video)
>>> for i, frame in enumerate(video_segment.frame_iterator()):
>>> if frame.exists("bar"):
>>> frame_after = video_segment.frame_from_index(i+1)
>>> return frame_after.exists("foo")
"""
return ImagePatch(self.trimmed_video[index])
def trim(self, start: Union[int, None] = None, end: Union[int, None] = None) -> VideoSegment:
"""Returns a new VideoSegment containing a trimmed version of the original video at the [start, end]
segment.
Parameters
----------
start : Union[int, None]
An int describing the starting frame in this video segment with respect to the original video.
end : Union[int, None]
An int describing the ending frame in this video segment with respect to the original video.
Examples
--------
>>> # Return the second half of the video
>>> def execute_command(video):
>>> video_segment = VideoSegment(video)
>>> video_second_half = video_segment.trim(video_segment.num_frames // 2, video_segment.num_frames)
>>> return video_second_half
"""
if start is not None:
start = max(start, 0)
if end is not None:
end = min(end, self.num_frames)
return VideoSegment(self.trimmed_video, start, end, self.start)
def frame_iterator(self) -> Iterator[ImagePatch]:
"""Returns an iterator over the frames in the video segment.
Examples
-------
>>> # Return the frame when the kid kisses the cat
>>> def execute_command(video):
>>> video_segment = VideoSegment(video)
>>> for i, frame in enumerate(video_segment.frame_iterator()):
>>> if frame.exists("kid") and frame.exists("cat") and frame.simple_query("Is the kid kissing the cat?") == "yes":
>>> return frame
"""
for i in range(self.num_frames):
yield self.frame_from_index(i)
# Examples of how to use the API
# INSERT_QUERY_HERE
def execute_command(INSERT_TYPE_HERE):
Could you provide an example of a query where you get that result? I would not remove the whole ImagePatch, as it is also used for video operations (per-frame operations). Just remove the methods that will not be needed in your application.
Also, for code LLMs, probably you will get better results using the chatapi.prompt
file (you will have to add the VideoSegment if you need video).
Hello, adding something that worked for me here. Hopefully, it would be helpful:
First using chatapi.prompt, generate the code as if it was for an image, then in another prompt file, provide the VideoSegment class only along with the generated image code. Ask ChatGPT to "Modify the code written for image to videos using the VideoSegment class".
Would be glad to provide more specifics!
Hello, I'm curious about something in your code base. It seems that cahatapi.prompt is only used once. You mentioned another prompt - could you point me to where in the code it is utilized? I am also currently trying to input a video.
It is not mentioned in the code, you will have to manually add it in. I put it here since it worked for me and was a way to use ChatGPT with videos.
Here is the prompt after using chatapi.prompt:
import math
class VideoSegment:
"""A Python class containing a set of frames represented as ImagePatch objects, as well as relevant information.
Attributes
----------
video : torch.Tensor
A tensor of the original video.
start : int
An int describing the starting frame in this video segment with respect to the original video.
end : int
An int describing the ending frame in this video segment with respect to the original video.
num_frames->int
An int containing the number of frames in the video segment.
Methods
-------
frame_iterator->Iterator[ImagePatch]
trim(start, end)->VideoSegment
Returns a new VideoSegment containing a trimmed version of the original video at the [start, end] segment.
frame_iterator->Iterator[ImagePatch]
Returns an iterator over the frames in the video segment.
"""
def __init__(self, video: torch.Tensor, start: int = None, end: int = None, parent_start=0, queues=None):
"""Initializes a VideoSegment object by trimming the video at the given [start, end] times and stores the
start and end times as attributes. If no times are provided, the video is left unmodified, and the times are
set to the beginning and end of the video.
Parameters
-------
video : torch.Tensor
A tensor of the original video.
start : int
An int describing the starting frame in this video segment with respect to the original video.
end : int
An int describing the ending frame in this video segment with respect to the original video.
"""
if start is None and end is None:
self.trimmed_video = video
self.start = 0
self.end = video.shape[0] # duration
else:
self.trimmed_video = video[start:end]
if start is None:
start = 0
if end is None:
end = video.shape[0]
self.start = start + parent_start
self.end = end + parent_start
self.num_frames = self.trimmed_video.shape[0]
def frame_from_index(self, index) -> ImagePatch:
"""Returns the frame at position 'index', as an ImagePatch object.
Examples
-------
>>> # Is there a foo in the frame bar appears?
>>> def execute_command(video)->bool:
>>> video_segment = VideoSegment(video)
>>> for i, frame in enumerate(video_segment.frame_iterator()):
>>> if frame.exists("bar"):
>>> frame_after = video_segment.frame_from_index(i+1)
>>> return frame_after.exists("foo")
"""
return ImagePatch(self.trimmed_video[index])
def trim(self, start: Union[int, None] = None, end: Union[int, None] = None) -> VideoSegment:
"""Returns a new VideoSegment containing a trimmed version of the original video at the [start, end]
segment.
Parameters
----------
start : Union[int, None]
An int describing the starting frame in this video segment with respect to the original video.
end : Union[int, None]
An int describing the ending frame in this video segment with respect to the original video.
Examples
--------
>>> # Return the second half of the video
>>> def execute_command(video):
>>> video_segment = VideoSegment(video)
>>> video_second_half = video_segment.trim(video_segment.num_frames // 2, video_segment.num_frames)
>>> return video_second_half
"""
if start is not None:
start = max(start, 0)
if end is not None:
end = min(end, self.num_frames)
return VideoSegment(self.trimmed_video, start, end, self.start)
def frame_iterator(self) -> Iterator[ImagePatch]:
"""Returns an iterator over the frames in the video segment.
Examples
-------
>>> # Return the frame when the kid kisses the cat
>>> def execute_command(video):
>>> video_segment = VideoSegment(video)
>>> for i, frame in enumerate(video_segment.frame_iterator()):
>>> if frame.exists("kid") and frame.exists("cat") and frame.simple_query("Is the kid kissing the cat?") == "yes":
>>> return frame
"""
for i in range(self.num_frames):
yield self.frame_from_index(i)
You are provided with code which uses an ImagePatch class which answers the query for an image. Modify the function using Python and the VideoSegment class (above) that could be executed to provide an answer to the query for a given video. Collect the result from all the frames and return the answer.
Consider the following guidelines:
- Use base Python (comparison, sorting) for basic logical operations, left/right/up/down, math, etc.
- Use the llm_query function to access external information and answer informational questions not concerning the image.
Code: INSERT_IMAGEPATCH_CODE_HERE
Query: INSERT_QUERY_HERE
Hi,
Is there a load_video function similar to load_image function that can load the video to be sent to the model?
Thanks