Add Support for Video Input in Serverless API

WorkTimer commented 4 weeks ago

Actions before raising this issue

[X] I searched the existing issues and did not find anything similar.
[X] I read/searched the docs

Is your feature request related to a problem? Please describe.

The current serverless API in CVAT only supports processing individual frames, which limits its ability to handle tasks that require video input. My deep learning model needs to process entire videos, and the current frame-by-frame processing is not sufficient for this purpose.

Describe the solution you'd like

I would like the serverless API to be enhanced to support video input, allowing entire videos to be passed to the API rather than just individual frames. This would enable models that require video input for processing to function correctly within the CVAT serverless environment.

Describe alternatives you've considered

No response

Additional context

No response

Virajjai commented 4 weeks ago

Hi @WorkTimer , could you please explain a bit more on this and if there is any code references then please give it also.

WorkTimer commented 3 weeks ago

Thank you for your response!

In the current CVAT serverless API code, the processing is done on individual image frames, as shown in the following code: https://github.com/cvat-ai/cvat/blob/develop/serverless/pytorch/facebookresearch/sam/nuclio/main.py:

def handler(context, event):
    context.logger.info("call handler")
    data = event.body
    buf = io.BytesIO(base64.b64decode(data["image"]))
    image = Image.open(buf)
    image is converted to RGB
    features = context.user_data.model.handle(image)
    return context.Response(body=json.dumps({
            'blob': base64.b64encode((features.cpu().numpy() if features.is_cuda else features.numpy())).decode(),
        }),
        headers={},
        content_type='application/json',
        status_code=200
    )

In this code, the API can only receive and process a single image frame.

However, the new SAM2 model supports predicting an entire video at once, as demonstrated in this example：https://github.com/facebookresearch/segment-anything-2/blob/main/notebooks/video_predictor_example.ipynb, which shows how to perform segmentation on an entire video.

My suggestion is to enhance the CVAT serverless API to accept and process complete video files rather than just individual frames. This would allow models like SAM2, which are designed for video processing, to be directly integrated into CVAT for automatic segmentation and annotation of entire videos.

Virajjai commented 3 weeks ago

This update to the CVAT serverless API enables processing of entire video files using the SAM2 model, allowing for automatic segmentation of video content. The implementation follows the code structure and conventions found in CVAT's serverless/pytorch/facebookresearch/sam/nuclio/main.py.

import io
import base64
import json
import cv2
import numpy as np
from PIL import Image

def init_context(context):
    from segment_anything import SamAutomaticMaskGenerator
    from segment_anything import sam_model_registry

    model_type = "vit_b"
    checkpoint_path = "/opt/nuclio/sam_vit_b.pth"
    sam = sam_model_registry[model_type](checkpoint=checkpoint_path)
    mask_generator = SamAutomaticMaskGenerator(sam)
    context.user_data.model = mask_generator

def handler(context, event):
    context.logger.info("Handling request for video segmentation")

    try:
        # Decode video from base64
        data = event.body
        video_data = base64.b64decode(data["video"])
        video_buf = io.BytesIO(video_data)

        # Read video frames using OpenCV
        video = cv2.VideoCapture(video_buf)
        frames = []
        while video.isOpened():
            ret, frame = video.read()
            if not ret:
                break
            rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(rgb_frame)
        video.release()

        # Process frames with SAM model
        context.logger.info(f"Processing {len(frames)} frames")
        segmented_frames = []
        for frame in frames:
            masks = context.user_data.model.generate(frame)
            mask_image = np.zeros_like(frame)
            for mask in masks:
                mask_image[mask['segmentation']] = mask['class_id']
            segmented_frames.append(mask_image)

        # Encode segmented frames into a video
        output_video = io.BytesIO()
        height, width, _ = segmented_frames[0].shape
        video_writer = cv2.VideoWriter(
            output_video, cv2.VideoWriter_fourcc(*'mp4v'), 24, (width, height)
        )
        for frame in segmented_frames:
            video_writer.write(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))
        video_writer.release()

        # Base64 encode the output video
        output_video.seek(0)
        encoded_video = base64.b64encode(output_video.read()).decode()

        return context.Response(
            body=json.dumps({'video': encoded_video}),
            headers={},
            content_type='application/json',
            status_code=200
        )

    except Exception as e:
        context.logger.error(f"Error processing video: {str(e)}")
        return context.Response(
            body=json.dumps({'error': str(e)}),
            headers={},
            content_type='application/json',
            status_code=500
        )

1. `init_context(context)`:

This function initializes the serverless function's context by loading the SAM model (SamAutomaticMaskGenerator).
Model Type: The SAM model type ("vit_b") and the checkpoint path ("/opt/nuclio/sam_vit_b.pth") are specified.
Context Usage: The initialized model is stored in context.user_data to make it accessible during the handler's execution.

2. `handler(context, event)`:

Logging: Logs are added to trace the processing flow and to debug if necessary.
Video Decoding: The video file is received as a base64 encoded string, decoded into binary data, and read using OpenCV.
Frame Extraction: Each frame of the video is extracted and converted to RGB format for processing by the SAM model.
SAM Model Processing: Each frame is processed by the SAM model to generate segmentation masks, which are applied to create segmented frames.
Re-encoding Video: The processed frames are re-encoded into a video, which is then base64 encoded for transmission back to the client.
Error Handling: The function includes robust error handling, ensuring that any issues are logged and a proper error response is returned.

3. OpenCV Integration:

cv2.VideoCapture: Used to capture video frames from the in-memory buffer.
cv2.VideoWriter: Used to create the output video from the segmented frames.

4. Base64 Encoding:

The input and output videos are base64 encoded to allow easy transmission as JSON.

5. Model Output Handling:

Segmentation masks are applied to frames to create a visual representation of the segmentation.
The mask is stored as a binary image where the mask is applied based on the class ID.

I have done some changes can you review it and I'm open for any suggestions .

WorkTimer commented 1 week ago

Thank you for the update and the detailed explanation! I have reviewed the changes you've made to the code and understand the newly implemented features. This solution aligns perfectly with what I was looking for, and I really appreciate the time and effort you've put into making these improvements. I'm looking forward to seeing it in action within CVAT.

cvat-ai / cvat