Closed WorkTimer closed 1 week ago
Hi @WorkTimer , could you please explain a bit more on this and if there is any code references then please give it also.
Thank you for your response!
In the current CVAT serverless API code, the processing is done on individual image frames, as shown in the following code: https://github.com/cvat-ai/cvat/blob/develop/serverless/pytorch/facebookresearch/sam/nuclio/main.py:
def handler(context, event):
context.logger.info("call handler")
data = event.body
buf = io.BytesIO(base64.b64decode(data["image"]))
image = Image.open(buf)
image is converted to RGB
features = context.user_data.model.handle(image)
return context.Response(body=json.dumps({
'blob': base64.b64encode((features.cpu().numpy() if features.is_cuda else features.numpy())).decode(),
}),
headers={},
content_type='application/json',
status_code=200
)
In this code, the API can only receive and process a single image frame.
However, the new SAM2 model supports predicting an entire video at once, as demonstrated in this example:https://github.com/facebookresearch/segment-anything-2/blob/main/notebooks/video_predictor_example.ipynb, which shows how to perform segmentation on an entire video.
My suggestion is to enhance the CVAT serverless API to accept and process complete video files rather than just individual frames. This would allow models like SAM2, which are designed for video processing, to be directly integrated into CVAT for automatic segmentation and annotation of entire videos.
This update to the CVAT serverless API enables processing of entire video files using the SAM2 model, allowing for automatic segmentation of video content. The implementation follows the code structure and conventions found in CVAT's serverless/pytorch/facebookresearch/sam/nuclio/main.py
.
import io
import base64
import json
import cv2
import numpy as np
from PIL import Image
def init_context(context):
from segment_anything import SamAutomaticMaskGenerator
from segment_anything import sam_model_registry
model_type = "vit_b"
checkpoint_path = "/opt/nuclio/sam_vit_b.pth"
sam = sam_model_registry[model_type](checkpoint=checkpoint_path)
mask_generator = SamAutomaticMaskGenerator(sam)
context.user_data.model = mask_generator
def handler(context, event):
context.logger.info("Handling request for video segmentation")
try:
# Decode video from base64
data = event.body
video_data = base64.b64decode(data["video"])
video_buf = io.BytesIO(video_data)
# Read video frames using OpenCV
video = cv2.VideoCapture(video_buf)
frames = []
while video.isOpened():
ret, frame = video.read()
if not ret:
break
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append(rgb_frame)
video.release()
# Process frames with SAM model
context.logger.info(f"Processing {len(frames)} frames")
segmented_frames = []
for frame in frames:
masks = context.user_data.model.generate(frame)
mask_image = np.zeros_like(frame)
for mask in masks:
mask_image[mask['segmentation']] = mask['class_id']
segmented_frames.append(mask_image)
# Encode segmented frames into a video
output_video = io.BytesIO()
height, width, _ = segmented_frames[0].shape
video_writer = cv2.VideoWriter(
output_video, cv2.VideoWriter_fourcc(*'mp4v'), 24, (width, height)
)
for frame in segmented_frames:
video_writer.write(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))
video_writer.release()
# Base64 encode the output video
output_video.seek(0)
encoded_video = base64.b64encode(output_video.read()).decode()
return context.Response(
body=json.dumps({'video': encoded_video}),
headers={},
content_type='application/json',
status_code=200
)
except Exception as e:
context.logger.error(f"Error processing video: {str(e)}")
return context.Response(
body=json.dumps({'error': str(e)}),
headers={},
content_type='application/json',
status_code=500
)
init_context(context)
:SamAutomaticMaskGenerator
)."vit_b"
) and the checkpoint path ("/opt/nuclio/sam_vit_b.pth"
) are specified.context.user_data
to make it accessible during the handler's execution.handler(context, event)
:cv2.VideoCapture
: Used to capture video frames from the in-memory buffer.cv2.VideoWriter
: Used to create the output video from the segmented frames.I have done some changes can you review it and I'm open for any suggestions .
Thank you for the update and the detailed explanation! I have reviewed the changes you've made to the code and understand the newly implemented features. This solution aligns perfectly with what I was looking for, and I really appreciate the time and effort you've put into making these improvements. I'm looking forward to seeing it in action within CVAT.
Actions before raising this issue
Is your feature request related to a problem? Please describe.
The current serverless API in CVAT only supports processing individual frames, which limits its ability to handle tasks that require video input. My deep learning model needs to process entire videos, and the current frame-by-frame processing is not sufficient for this purpose.
Describe the solution you'd like
I would like the serverless API to be enhanced to support video input, allowing entire videos to be passed to the API rather than just individual frames. This would enable models that require video input for processing to function correctly within the CVAT serverless environment.
Describe alternatives you've considered
No response
Additional context
No response