cvat-ai / cvat

Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
https://cvat.ai
MIT License
12.23k stars 2.95k forks source link

Add Support for Video Input in Serverless API #8310

Closed WorkTimer closed 1 week ago

WorkTimer commented 4 weeks ago

Actions before raising this issue

Is your feature request related to a problem? Please describe.

The current serverless API in CVAT only supports processing individual frames, which limits its ability to handle tasks that require video input. My deep learning model needs to process entire videos, and the current frame-by-frame processing is not sufficient for this purpose.

Describe the solution you'd like

I would like the serverless API to be enhanced to support video input, allowing entire videos to be passed to the API rather than just individual frames. This would enable models that require video input for processing to function correctly within the CVAT serverless environment.

Describe alternatives you've considered

No response

Additional context

No response

Virajjai commented 4 weeks ago

Hi @WorkTimer , could you please explain a bit more on this and if there is any code references then please give it also.

WorkTimer commented 3 weeks ago

Thank you for your response!

In the current CVAT serverless API code, the processing is done on individual image frames, as shown in the following code: https://github.com/cvat-ai/cvat/blob/develop/serverless/pytorch/facebookresearch/sam/nuclio/main.py:

def handler(context, event):
    context.logger.info("call handler")
    data = event.body
    buf = io.BytesIO(base64.b64decode(data["image"]))
    image = Image.open(buf)
    image is converted to RGB
    features = context.user_data.model.handle(image)
    return context.Response(body=json.dumps({
            'blob': base64.b64encode((features.cpu().numpy() if features.is_cuda else features.numpy())).decode(),
        }),
        headers={},
        content_type='application/json',
        status_code=200
    )

In this code, the API can only receive and process a single image frame.

However, the new SAM2 model supports predicting an entire video at once, as demonstrated in this example:https://github.com/facebookresearch/segment-anything-2/blob/main/notebooks/video_predictor_example.ipynb, which shows how to perform segmentation on an entire video.

My suggestion is to enhance the CVAT serverless API to accept and process complete video files rather than just individual frames. This would allow models like SAM2, which are designed for video processing, to be directly integrated into CVAT for automatic segmentation and annotation of entire videos.

Virajjai commented 3 weeks ago

This update to the CVAT serverless API enables processing of entire video files using the SAM2 model, allowing for automatic segmentation of video content. The implementation follows the code structure and conventions found in CVAT's serverless/pytorch/facebookresearch/sam/nuclio/main.py.

import io
import base64
import json
import cv2
import numpy as np
from PIL import Image

def init_context(context):
    from segment_anything import SamAutomaticMaskGenerator
    from segment_anything import sam_model_registry

    model_type = "vit_b"
    checkpoint_path = "/opt/nuclio/sam_vit_b.pth"
    sam = sam_model_registry[model_type](checkpoint=checkpoint_path)
    mask_generator = SamAutomaticMaskGenerator(sam)
    context.user_data.model = mask_generator

def handler(context, event):
    context.logger.info("Handling request for video segmentation")

    try:
        # Decode video from base64
        data = event.body
        video_data = base64.b64decode(data["video"])
        video_buf = io.BytesIO(video_data)

        # Read video frames using OpenCV
        video = cv2.VideoCapture(video_buf)
        frames = []
        while video.isOpened():
            ret, frame = video.read()
            if not ret:
                break
            rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(rgb_frame)
        video.release()

        # Process frames with SAM model
        context.logger.info(f"Processing {len(frames)} frames")
        segmented_frames = []
        for frame in frames:
            masks = context.user_data.model.generate(frame)
            mask_image = np.zeros_like(frame)
            for mask in masks:
                mask_image[mask['segmentation']] = mask['class_id']
            segmented_frames.append(mask_image)

        # Encode segmented frames into a video
        output_video = io.BytesIO()
        height, width, _ = segmented_frames[0].shape
        video_writer = cv2.VideoWriter(
            output_video, cv2.VideoWriter_fourcc(*'mp4v'), 24, (width, height)
        )
        for frame in segmented_frames:
            video_writer.write(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))
        video_writer.release()

        # Base64 encode the output video
        output_video.seek(0)
        encoded_video = base64.b64encode(output_video.read()).decode()

        return context.Response(
            body=json.dumps({'video': encoded_video}),
            headers={},
            content_type='application/json',
            status_code=200
        )

    except Exception as e:
        context.logger.error(f"Error processing video: {str(e)}")
        return context.Response(
            body=json.dumps({'error': str(e)}),
            headers={},
            content_type='application/json',
            status_code=500
        )

1. init_context(context):

2. handler(context, event):

3. OpenCV Integration:

4. Base64 Encoding:

5. Model Output Handling:

I have done some changes can you review it and I'm open for any suggestions .

WorkTimer commented 1 week ago

Thank you for the update and the detailed explanation! I have reviewed the changes you've made to the code and understand the newly implemented features. This solution aligns perfectly with what I was looking for, and I really appreciate the time and effort you've put into making these improvements. I'm looking forward to seeing it in action within CVAT.