HumanSignal / label-studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format
https://labelstud.io
Apache License 2.0
19.04k stars 2.37k forks source link

Video Object Tracking Annotations can't export to YOLOv5 #3405

Open djwhatle opened 1 year ago

djwhatle commented 1 year ago

Is your feature request related to a problem? Please describe. I noticed that I can't export labels from a video object tracking task to YOLOv5 format. I think this is because YOLOv5 exports are only possible with Rectangle Annotations.

Describe the solution you'd like It seems like it would be feasible to cut the video frames and interpolated rect dimensions up into the format expected by YOLOv5 and this would make the workflow pretty nice for getting a lot of labeled frames. Any plans to integrate this in the future, or perhaps this was an oversight or there's a technical limitation I'm not seeing?

Describe alternatives you've considered If there's a way to use label studio converter to convert object tracking JSON to YOLOv5 format, I suppose that would solve the problem as well, while being a little less convenient.

I suppose there might also be a model other than YOLOv5 I'm not aware of that can readily accept the JSON data format provided by label studio from Video Object tracking and run with it.

makseq commented 1 year ago

Yolo5 export is not supported. It's unclear why you need yolo5 export, rectangles will be mixed across frames, because yolo5 doesn't store rectangle ids, does it?

jpkoponen commented 1 year ago

I also came across this issue. I'm not certain of the context behind @makseq's comment, but I'll try to provide more clarity. Label Studio currently only allows for the extraction of key frames from video object tracking. On the other hand, the YOLO format doesn't consider frame IDs, but instead requires the rectangle location and dimensions for each frame. YOLO is a popular method for video object tracking. It would be beneficial if Label Studio had the capability to export or convert to the YOLO format, eliminating the need for us to individually implement linear interpolation and reformatting.

Busterfake commented 1 year ago

Hello, i just discovered Label studio and was using labelImg previously. When i saw Label studio can label from videos directly i was very hyped. Unfortunatly it seems that you cannot export a yolov5 format from a video. I would be very interested if that could be possible in the future. It just needs to export an image for each frame ID with the associated label. Any information if it is planned to be added soon ?

makseq commented 1 year ago

@jpkoponen @Busterfake Thank you for your feedbacks, we will discuss it with out team. However, is it ok to lose bbox ids during the export to YOLO? You won't have relations for bboxes across frames.

jpkoponen commented 1 year ago

@makseq Yes, it is okay to lose the bbox ids. Sorry for misusing the word "object tracking" in my previous reply. YOLO is primarily used for object detection and recognition in videos, rather than object tracking. This means that the focus is on recognizing and locating objects in each frame individually, rather than keeping track of specific objects across frames. Label Studio's object tracking feature is useful for annotating those videos, but the bbox ids are not necessary for YOLO.

Busterfake commented 1 year ago

@makseq @jpkoponen Indeed we don't need the bbox ids. What we idealy need for the yolo export is a .txt for each frame of the video (or every several frames) and a screenshot of each frame with a txt associated. Correct me if i'm wrong, but from what i know we cannot train a yolo model with videos so it would be very helpful if with can get the txt and the images when we export with yolo. For my dataset, i have videos where i extract images every frame and then i use labelImg to label theses images. It's very time consumming and not very fun to do. I'm pretty sure if we can use your video object tracking annontations and make it work for yolo export it will help a lot of people and save a lot of time :)

jpkoponen commented 1 year ago

While waiting for the feature, I made a short and unoptimized script for converting the labels to YOLO format. Maybe it would be useful for someone like @Busterfake or @djwhatle.

Edit Jan 26th 2023: I hadn't taken into account that YOLO uses center of object for coordinates and Label Studio uses the corner. I updated the code to take it into account. labelstudio_to_yolo.py

import json
import os
from pathlib import Path
import argparse

def labelstudio_labels_to_yolo(labelstudio_labels_path: str, label_names_path: str, output_dir_path: str) -> None:
    with open(label_names_path, 'r') as f:
        label_names = f.read().split('\n')
    print('Label names:', label_names)
    with open(labelstudio_labels_path, 'r') as f:
        labelstudio_labels_json = f.read()
    labels = json.loads(labelstudio_labels_json)[0]
    # every box stores the frame count of the whole video so we get it from the first box
    frames_count = labels['annotations'][0]['result'][0]['value']['framesCount']

    yolo_labels = [[] for _ in range(frames_count)]
    # iterate through boxes
    for box in labels['annotations'][0]['result']:
        label_numbers = [label_names.index(label) for label in box['value']['labels']]
        # iterate through keypoints (we omit the last keypoint because no interpolation after that)
        for i, keypoint in enumerate(box['value']['sequence'][:-1]):
            start_point = keypoint
            end_point = box['value']['sequence'][i + 1]
            start_frame = start_point['frame']
            end_frame = end_point['frame']

            n_frames_between = end_frame - start_frame
            delta_x = (end_point['x'] - start_point['x']) / n_frames_between
            delta_y = (end_point['y'] - start_point['y']) / n_frames_between
            delta_width = (end_point['width'] - start_point['width']) / n_frames_between
            delta_height = (end_point['height'] - start_point['height']) / n_frames_between

            # In YOLO, x and y are in the center of the box. In Label Studio, x and y are in the corner of the box.
            x = start_point['x'] + start_point['width'] / 2
            y = start_point['y'] + start_point['height'] / 2
            width = start_point['width']
            height = start_point['height']
            # iterate through frames between two keypoints
            for frame in range(start_frame, end_frame):
                # Support for multilabel
                yolo_labels = _append_to_yolo_labels(yolo_labels, frame, label_numbers, x, y, width, height)
                x += delta_x + delta_width / 2
                y += delta_y + delta_height / 2
                width += delta_width
                height += delta_height
            # Make sure that the loop works as intended
            epsilon = 1e-5
            assert (x - end_point['x'] - end_point['width'] / 2) <= epsilon, f'x does not match: {x} vs {end_point["x"] + end_point["width"] / 2}'
            assert (y - end_point['y'] - end_point['height'] / 2) <= epsilon, f'y does not match: {y} vs {end_point["y"] + end_point["height"] / 2}'
            assert (width - end_point[
                'width']) <= epsilon, f'width does not match: {width} vs {end_point["width"]}'
            assert (height - end_point[
                'height']) <= epsilon, f'height does not match: {height} vs {end_point["height"]}'

        # Handle last keypoint
        yolo_labels = _append_to_yolo_labels(yolo_labels, frame, label_numbers, x, y, width, height)
        for label_number in label_numbers:
            # frame-1 because Label Studio index starts from 1
            yolo_labels[frame-1].append(
                [label_number, x / 100, y / 100, width / 100, height / 100])

    if not os.path.exists(output_dir_path):
        os.makedirs(output_dir_path)
        print(f'Directory did not exist. Created {output_dir_path}')
    for frame, frame_labels in enumerate(yolo_labels):
        if frame % 1000 == 0:
            print(f'Writing labels for frame {frame}')
        padded_frame_number = str(frame).zfill(len(str(len(yolo_labels))))
        file_path = Path(output_dir_path) / f'frame_{padded_frame_number}.txt'
        text = ''
        for label in frame_labels:
            text += ' '.join(map(str, label)) + '\n'
        with open(file_path, 'w') as f:
            f.write(text)
    print(f'Done. Wrote labels for {frame + 1} frames.')

def _append_to_yolo_labels(yolo_labels: list, frame: int, label_numbers: list, x, y, width, height):
    for label_number in label_numbers:
        # current_frame-1 because Label Studio index starts from 1
        yolo_labels[frame-1].append(
            [label_number, x / 100, y / 100, width / 100, height / 100])
    return yolo_labels

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input', '-i', help='Path of the .txt file containing Label Studio labels.', required=True)
    parser.add_argument('--names', '-n', help='Path of the .json file containing (case sensitive) label names.',
                        required=True)
    parser.add_argument('--output', '-o', help='Path of the output directory of .txt files containing YOLO labels.',
                        required=True)

    args = parser.parse_args()

    labelstudio_labels_to_yolo(args.input, args.names, args.output)

I also made a script to convert the video to images in your desired framerate, since I accidentally labeled a 30fps video with 25fps in Label Studio. :)

video_to_images.py


import cv2
from tqdm import tqdm
from pathlib import Path
import argparse
import os

def video_to_images(video_path: str, images_dir_path: str, target_frame_rate: float):
    def _save_frame(frame_number, img):
        padded_frame_number = str(frame_number).zfill(len(str(target_length)))
        save_path = Path(images_dir_path) / f'frame_{padded_frame_number}.jpg'
        cv2.imwrite(str(save_path), img)

    if not os.path.exists(images_dir_path):
        os.makedirs(images_dir_path)
        print(f'Directory did not exist. Created {images_dir_path}')
    cap = cv2.VideoCapture(video_path)

    orig_frame_rate = cap.get(cv2.CAP_PROP_FPS)
    orig_length = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    target_length = round(orig_length / orig_frame_rate * target_frame_rate)
    print(f'Orig frames: {orig_length}')
    print(f'Target frames: {target_length}')
    pbar = tqdm(total=target_length)

    # Start from first frame
    frame_id = 0
    success, frame = cap.read()
    _save_frame(frame_id, frame)
    last_position_s = 0
    while True:
        success, frame = cap.read()
        if not success:
            break
        position_s = float(cap.get(cv2.CAP_PROP_POS_MSEC)) / 1000
        delta_s = position_s - last_position_s
        if delta_s >= (1 / target_frame_rate):
            frame_id += 1
            _save_frame(frame_id, frame)
            pbar.update(1)
            last_position_s += 1 / target_frame_rate

    cap.release()
    print(f'Done. Wrote video from {video_path} to {images_dir_path}. Last frame was {frame_id}.')

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input', '-i', help='Path of the video file.',
                        required=True)
    parser.add_argument('--output', '-o', help='Path of the output directory of .jpg files.',
                        required=True)
    parser.add_argument('--frame-rate', '-fr', help='Target framerate.', required=True)
    args = parser.parse_args()

    video_to_images(args.input, args.output, float(args.frame_rate))
tsukasagenesis commented 1 year ago

To add to this discussion, it would be great if you could add support of video format in "object detection mode", and not only in "video tracking mode", just if it's a video, you convert in picture only selected keyframe.

It would be great for manage all image/video database in one big dataset.

makseq commented 1 year ago

@tsukasagenesis

support of video format in "object detection mode", and not only in "video tracking mode", just if it's a video, you convert in picture only selected keyframe.

could you please clarify this more?

deepinvalue commented 1 year ago

@djwhatle @Busterfake @jpkoponen Just in case, I had also developed a script to enable key-frame annotations directly on videos and export the labels and frames in a YOLO-compatible format, including interpolation of bounding box coordinates. The script is now accessible in this repository: deepinvalue/video_annotations_to_yolo

shure-dev commented 1 year ago

Hi, I hope this issue is still alive. Which export format and deep learning model should be used for video object tracking tasks in the easiest way, in conclusion? I don't care if it's not YOLO. I annotated with Label Studio for my video and what is the conclusion for the easiest way to train the model and predict? I don't want to implment additional code. What is the efficient way? Maybe we shouldn't use Label Studio for video object-tracking tasks? If this is true, what are the other possible options for annotation tool?

We have to use the code @jpkoponen provided?

NWalker4483 commented 2 months ago

A bit old but the supplied version didn't fully work for me so here's on that'll convert multiple videos to COCO format. https://github.com/NWalker4483/LabelStudioVideo/