2D Bounding Boxes Labels

mhusseinsh commented 4 years ago

Hello, I downloaded the dataset, and I want to run Object Detector on images. Is there a way to extract from the labels the groundtruth 2D bounding boxes of the vehicles, pedestrians and so on?

johnwlambert commented 4 years ago

Hi @mhusseinsh, thanks for your interest in our work. We discuss this briefly in https://github.com/argoai/argoverse-api/issues/19.

If you use the following code snippet on all data splits, you could expect to get >1 million 2d bounding boxes (we have ~1M cuboid instances) and our 9 cameras do have some overlap in field of view (especially stereo and ring_front_center) so multiple 2d bounding boxes are possible for a single 3d cuboid.

2d bounding boxes come in several different flavors. For example, would you want "amodal" or "modal" boxes? COCO 2d bboxes are not "amodal", and were likely derived from segmentation masks. We annotated 3d amodal cuboids, so if a vehicle is half-occluded behind a vehicle, our naive 2d bounding box will hallucinate the invisible part of the vehicle, even if sensor data can't observe it fully yet (white truck on right): ring_front_center_315966776606736312 However, you could use the LiDAR to form a depth map and reason about occlusion via relative ordering in the depth map.

Most of our 3d cuboids are very tight-fitting, but a few will be slightly taller than expected in height (white sedan on left): ring_front_center_315966775507836712

johnwlambert commented 4 years ago


import argparse
import copy
import glob
import logging
import math
import multiprocessing
import os
import sys
from pathlib import Path
from typing import Any, Iterable, List, Mapping, Sequence, Tuple, Union

import cv2
import imageio
import matplotlib.pyplot as plt
import numpy as np

import argoverse
from argoverse.utils import calibration
from argoverse.data_loading.object_label_record import json_label_dict_to_obj_record
from argoverse.data_loading.simple_track_dataloader import SimpleArgoverseTrackingDataLoader
from argoverse.map_representation.map_api import ArgoverseMap
from argoverse.utils.calibration import (
    CameraConfig,
    get_calibration_config,
    point_cloud_to_homogeneous,
    project_lidar_to_img_motion_compensated,
    project_lidar_to_undistorted_img,
)
from argoverse.utils.camera_stats import (
    RING_CAMERA_LIST,
    STEREO_CAMERA_LIST,
    RING_IMG_HEIGHT,
    RING_IMG_WIDTH
)
from argoverse.utils.city_visibility_utils import clip_point_cloud_to_visible_region
from argoverse.utils.cv2_plotting_utils import draw_clipped_line_segment
from argoverse.utils.ffmpeg_utils import write_nonsequential_idx_video
from argoverse.utils.frustum_clipping import (
    generate_frustum_planes,
    cuboid_to_2d_frustum_bbox
)
from argoverse.utils.ply_loader import load_ply
from argoverse.utils.se3 import SE3

# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logger = logging.getLogger(__name__)

#: Any numeric type
Number = Union[int, float]

def plot_img_2d_bboxes(
    labels,
    planes,
    img_bgr: np.ndarray,
    log_calib_data,
    camera_name: str,
    cam_timestamp: int,
    lidar_timestamp: int,
    data_dir: str,
    log_id: str,
    save_img_fpath: str
):
    """ """
    for label_idx, label in enumerate(labels):
        obj_rec = json_label_dict_to_obj_record(label)
        if obj_rec.occlusion == 100:
            continue

        cuboid_vertices = obj_rec.as_3d_bbox()
        points_h = point_cloud_to_homogeneous(cuboid_vertices).T

        uv, uv_cam, valid_pts_bool, camera_config = project_lidar_to_img_motion_compensated(
            points_h,  # these are recorded at lidar_time
            copy.deepcopy(log_calib_data),
            camera_name,
            cam_timestamp,
            lidar_timestamp,
            data_dir,
            log_id,
            return_K=True,
        )
        K = camera_config.intrinsic

        # if valid_pts_bool.sum() == 0:
        #     continue
        bbox_2d = cuboid_to_2d_frustum_bbox(uv_cam.T[:,:3], planes, K[:3,:3])
        if bbox_2d is None:
            continue
        else:
            x1,y1,x2,y2 = bbox_2d

            x1 = min(x1,RING_IMG_WIDTH-1)
            x2 = min(x2,RING_IMG_WIDTH-1)
            y1 = min(y1,RING_IMG_HEIGHT-1)
            y2 = min(y2,RING_IMG_HEIGHT-1)

            x1 = max(x1, 0)
            x2 = max(x2, 0)
            y1 = max(y1, 0)
            y2 = max(y2, 0)

            plt.plot([x1,x1],[y1,y2], 'r')
            plt.plot([x1,x2],[y1,y1], 'r')
            plt.plot([x1,x2],[y2,y2], 'r')
            plt.plot([x2,x2],[y1,y2], 'r')

    plt.imshow(img_bgr[:,:,::-1])
    plt.savefig(save_img_fpath)
    plt.close('all')

    #cv2.imwrite(save_img_fpath, img_bgr)

def dump_log_2d_bboxes_to_imgs(
    log_ids: Sequence[str],
    max_num_images_to_render: int,
    data_dir: str,
    experiment_prefix: str,
    motion_compensate: bool = True,
) -> List[str]:
    """
    We bring the 3D points into each camera coordinate system, and do the clipping there in 3D.

    Args:
        log_ids: A list of log IDs
        max_num_images_to_render: maximum numbers of images to render.
        data_dir: path to dataset with the latest data
        experiment_prefix: Output directory
        motion_compensate: Whether to motion compensate when projecting

    Returns:
        saved_img_fpaths
    """
    saved_img_fpaths = []
    dl = SimpleArgoverseTrackingDataLoader(data_dir=data_dir, labels_dir=data_dir)
    avm = ArgoverseMap()

    for log_id in log_ids:
        save_dir = f"{experiment_prefix}_{log_id}"
        if not Path(save_dir).exists():
            os.makedirs(save_dir)

        city_name = dl.get_city_name(log_id)
        log_calib_data = dl.get_log_calibration_data(log_id)

        flag_done = False
        for cam_idx, camera_name in enumerate(RING_CAMERA_LIST + STEREO_CAMERA_LIST):

            if camera_name != 'ring_front_center':
                continue

            cam_im_fpaths = dl.get_ordered_log_cam_fpaths(log_id, camera_name)
            for i, im_fpath in enumerate(cam_im_fpaths):
                if i % 50 == 0:
                    logging.info("\tOn file %s of camera %s of %s", i, camera_name, log_id)

                cam_timestamp = Path(im_fpath).stem.split("_")[-1]
                cam_timestamp = int(cam_timestamp)

                # load PLY file path, e.g. 'PC_315978406032859416.ply'
                ply_fpath = dl.get_closest_lidar_fpath(log_id, cam_timestamp)
                if ply_fpath is None:
                    continue
                lidar_pts = load_ply(ply_fpath)
                save_img_fpath = f"{save_dir}/{camera_name}_{cam_timestamp}.jpg"
                if Path(save_img_fpath).exists():
                    continue

                city_to_egovehicle_se3 = dl.get_city_to_egovehicle_se3(log_id, cam_timestamp)
                if city_to_egovehicle_se3 is None:
                    continue

                lidar_timestamp = Path(ply_fpath).stem.split("_")[-1]
                lidar_timestamp = int(lidar_timestamp)
                labels = dl.get_labels_at_lidar_timestamp(log_id, lidar_timestamp)
                if labels is None:
                    logging.info("\tLabels missing at t=%s", lidar_timestamp)
                    continue

                # Swap channel order as OpenCV expects it -- BGR not RGB
                # must make a copy to make memory contiguous
                img_bgr = imageio.imread(im_fpath)[:, :, ::-1].copy()
                camera_config = get_calibration_config(log_calib_data, camera_name)
                planes = generate_frustum_planes(camera_config.intrinsic.copy(), camera_name)

                plot_img_2d_bboxes(labels, planes, img_bgr, log_calib_data, camera_name, cam_timestamp, lidar_timestamp, data_dir, log_id, save_img_fpath)

                if i > 100:
                    break

        category_subdir = "2d_amodal_labels_100fr"

        if not Path(f"{experiment_prefix}_{category_subdir}").exists():
            os.makedirs(f"{experiment_prefix}_{category_subdir}")

        for cam_idx, camera_name in enumerate(RING_CAMERA_LIST + STEREO_CAMERA_LIST):
            # Write the cuboid video -- could also write w/ fps=20,30,40
            if "stereo" in camera_name:
                fps = 5
            else:
                fps = 30
            img_wildcard = f"{save_dir}/{camera_name}_%*.jpg"
            output_fpath = f"{experiment_prefix}_{category_subdir}/{log_id}_{camera_name}_{fps}fps.mp4"
            write_nonsequential_idx_video(img_wildcard, output_fpath, fps)

def main(args: Any):
    """Run the example."""
    log_ids = [log_id.strip() for log_id in args.log_ids.split(",")]
    dump_log_2d_bboxes_to_imgs(
        log_ids, args.max_num_images_to_render * 9, args.dataset_dir, args.experiment_prefix
    )

if __name__ == "__main__":
    # Parse command line arguments
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--max-num-images-to-render", default=5, type=int, help="number of images within which to render 3d cuboids"
    )
    parser.add_argument("--dataset-dir", type=str, required=True, help="path to the dataset folder")
    parser.add_argument(
        "--log-ids",
        type=str,
        required=True,
        help="comma separated list of log ids, each log_id represents a log directory, e.g. found at "
        " {args.dataset-dir}/argoverse-tracking/train/{log_id} or "
        " {args.dataset-dir}/argoverse-tracking/sample/{log_id} or ",
    )
    parser.add_argument(
        "--experiment-prefix",
        default="output",
        type=str,
        help="results will be saved in a folder with this prefix for its name",
    )
    args = parser.parse_args()
    logger.info(args)

    if args.log_ids is None:
        logger.error(f"Please provide a comma seperated list of log ids")
        raise ValueError(f"Please provide a comma seperated list of log ids")

    main(args)

mhusseinsh commented 4 years ago

Hello @johnwlambert Thank you so much for your help.

But this code is only for visualizations. What I am seeking is having an appropriate GT txt files for each image, containing the box coordinates and the class category. What I see here in this code, that we don't have any information anywhere on the class category of the drawn boxes

johnwlambert commented 4 years ago

Hi @mhusseinsh, you're welcome. To save the ground truth, I recommend JSON instead of txt for its structured format. You can directly save (x1,y1,x2,y2) to disk after they are computed for each object in the script above, and the object category is available inside the ObjectLabelRecord class as obj_rec.label_class in the code above.

mhusseinsh commented 4 years ago

Hi @johnwlambert I run the following command python3 extractGroundTruth.py --dataset-dir argoverse-tracking/val/ --log-ids 033669d3-3d6b-3d3d-bd93-7985d86653ea

A folder is created with name output_033669d3-3d6b-3d3d-bd93-7985d86653ea which contains images inside, a sample is shown below

But afterwards it gave an error related to video creation, and it only created 102 images out of the 900 found in the sequence

johnwlambert commented 4 years ago

Hi @mhusseinsh, in the code above, I break out of the full loop early just to generate short videos as an illustration:

                if i > 100:
                    break

If you comment out those lines, the script will render all images. You can comment out/remove the following image rendering lines if you only want the 2d ground truth:

            plt.plot([x1,x1],[y1,y2], 'r')
            plt.plot([x1,x2],[y1,y1], 'r')
            plt.plot([x1,x2],[y2,y2], 'r')
            plt.plot([x2,x2],[y1,y2], 'r')

    plt.imshow(img_bgr[:,:,::-1])
    plt.savefig(save_img_fpath)
    plt.close('all')

mhusseinsh commented 4 years ago

@johnwlambert Thanks a lot for your help. It works now perfectly fine. Apart that there are some objects which do have 2 overlapping boxes. I understand that you mentioned this due to the cameras overlapping. But there is no way to prevent this while creating my ground truth? Because definitely this will affect my KPIs calculations

johnwlambert commented 4 years ago

HI @mhusseinsh, you're welcome. Could you discuss the problem you're encountering in more detail? For example, are you referring to (A) overlapping 2d bounding boxes in a single image, or (B) multiple 2d bboxes corresponding to the same object, when captured by multiple cameras simultaneously?

Regarding (A) -- the extents of objects fundamentally overlap in 2d, e.g. this car and bus:

Regarding (B) -- this shouldn't be an issue for training/testing a model to operate on a single camera's output, since the 2d bboxes are correct for any given image. If you are stitching together images and 2d bboxes into a single combined image, we could discuss other options.

mhusseinsh commented 4 years ago

Hello @johnwlambert Thanks for your detailed reply. My problem is concerning (B), but I believe this would effect. As an example, I get an image with one vehicle but has as GT 2 BoundingBoxes. Now I run an Object Detector, I get one detection for this vehicle. Using IoU and Hungarian algorithm, I assign the detected box to one of the 2 GT boxes. So it remains one GT box in the labels, which leads to a False Negative in my confusion matrix, and accordingly minimizing my recall value as KPI for my evaluation.

johnwlambert commented 4 years ago

Hi @mhusseinsh, could you point me to a specific example/log-id/frame where you observe this behavior?

As long as you evaluate your detector on the output of a single image using the ground truth for that specific image, your evaluation will be correct. An image with one vehicle visible will be paired with only 1 GT bounding box, if you use the script featured above.

johnwlambert commented 4 years ago

Hi @mhusseinsh, I'm closing this issue for now since I haven't heard back from you. Feel free to re-open it if you have additional questions. Thanks!

argoverse / argoverse-api

2D Bounding Boxes Labels #144