Closed mhusseinsh closed 4 years ago
Hi @mhusseinsh, thanks for your interest in our work. We discuss this briefly in https://github.com/argoai/argoverse-api/issues/19.
If you use the following code snippet on all data splits, you could expect to get >1 million 2d bounding boxes (we have ~1M cuboid instances) and our 9 cameras do have some overlap in field of view (especially stereo and ring_front_center) so multiple 2d bounding boxes are possible for a single 3d cuboid.
2d bounding boxes come in several different flavors. For example, would you want "amodal" or "modal" boxes? COCO 2d bboxes are not "amodal", and were likely derived from segmentation masks. We annotated 3d amodal cuboids, so if a vehicle is half-occluded behind a vehicle, our naive 2d bounding box will hallucinate the invisible part of the vehicle, even if sensor data can't observe it fully yet (white truck on right): However, you could use the LiDAR to form a depth map and reason about occlusion via relative ordering in the depth map.
Most of our 3d cuboids are very tight-fitting, but a few will be slightly taller than expected in height (white sedan on left):
import argparse
import copy
import glob
import logging
import math
import multiprocessing
import os
import sys
from pathlib import Path
from typing import Any, Iterable, List, Mapping, Sequence, Tuple, Union
import cv2
import imageio
import matplotlib.pyplot as plt
import numpy as np
import argoverse
from argoverse.utils import calibration
from argoverse.data_loading.object_label_record import json_label_dict_to_obj_record
from argoverse.data_loading.simple_track_dataloader import SimpleArgoverseTrackingDataLoader
from argoverse.map_representation.map_api import ArgoverseMap
from argoverse.utils.calibration import (
CameraConfig,
get_calibration_config,
point_cloud_to_homogeneous,
project_lidar_to_img_motion_compensated,
project_lidar_to_undistorted_img,
)
from argoverse.utils.camera_stats import (
RING_CAMERA_LIST,
STEREO_CAMERA_LIST,
RING_IMG_HEIGHT,
RING_IMG_WIDTH
)
from argoverse.utils.city_visibility_utils import clip_point_cloud_to_visible_region
from argoverse.utils.cv2_plotting_utils import draw_clipped_line_segment
from argoverse.utils.ffmpeg_utils import write_nonsequential_idx_video
from argoverse.utils.frustum_clipping import (
generate_frustum_planes,
cuboid_to_2d_frustum_bbox
)
from argoverse.utils.ply_loader import load_ply
from argoverse.utils.se3 import SE3
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logger = logging.getLogger(__name__)
#: Any numeric type
Number = Union[int, float]
def plot_img_2d_bboxes(
labels,
planes,
img_bgr: np.ndarray,
log_calib_data,
camera_name: str,
cam_timestamp: int,
lidar_timestamp: int,
data_dir: str,
log_id: str,
save_img_fpath: str
):
""" """
for label_idx, label in enumerate(labels):
obj_rec = json_label_dict_to_obj_record(label)
if obj_rec.occlusion == 100:
continue
cuboid_vertices = obj_rec.as_3d_bbox()
points_h = point_cloud_to_homogeneous(cuboid_vertices).T
uv, uv_cam, valid_pts_bool, camera_config = project_lidar_to_img_motion_compensated(
points_h, # these are recorded at lidar_time
copy.deepcopy(log_calib_data),
camera_name,
cam_timestamp,
lidar_timestamp,
data_dir,
log_id,
return_K=True,
)
K = camera_config.intrinsic
# if valid_pts_bool.sum() == 0:
# continue
bbox_2d = cuboid_to_2d_frustum_bbox(uv_cam.T[:,:3], planes, K[:3,:3])
if bbox_2d is None:
continue
else:
x1,y1,x2,y2 = bbox_2d
x1 = min(x1,RING_IMG_WIDTH-1)
x2 = min(x2,RING_IMG_WIDTH-1)
y1 = min(y1,RING_IMG_HEIGHT-1)
y2 = min(y2,RING_IMG_HEIGHT-1)
x1 = max(x1, 0)
x2 = max(x2, 0)
y1 = max(y1, 0)
y2 = max(y2, 0)
plt.plot([x1,x1],[y1,y2], 'r')
plt.plot([x1,x2],[y1,y1], 'r')
plt.plot([x1,x2],[y2,y2], 'r')
plt.plot([x2,x2],[y1,y2], 'r')
plt.imshow(img_bgr[:,:,::-1])
plt.savefig(save_img_fpath)
plt.close('all')
#cv2.imwrite(save_img_fpath, img_bgr)
def dump_log_2d_bboxes_to_imgs(
log_ids: Sequence[str],
max_num_images_to_render: int,
data_dir: str,
experiment_prefix: str,
motion_compensate: bool = True,
) -> List[str]:
"""
We bring the 3D points into each camera coordinate system, and do the clipping there in 3D.
Args:
log_ids: A list of log IDs
max_num_images_to_render: maximum numbers of images to render.
data_dir: path to dataset with the latest data
experiment_prefix: Output directory
motion_compensate: Whether to motion compensate when projecting
Returns:
saved_img_fpaths
"""
saved_img_fpaths = []
dl = SimpleArgoverseTrackingDataLoader(data_dir=data_dir, labels_dir=data_dir)
avm = ArgoverseMap()
for log_id in log_ids:
save_dir = f"{experiment_prefix}_{log_id}"
if not Path(save_dir).exists():
os.makedirs(save_dir)
city_name = dl.get_city_name(log_id)
log_calib_data = dl.get_log_calibration_data(log_id)
flag_done = False
for cam_idx, camera_name in enumerate(RING_CAMERA_LIST + STEREO_CAMERA_LIST):
if camera_name != 'ring_front_center':
continue
cam_im_fpaths = dl.get_ordered_log_cam_fpaths(log_id, camera_name)
for i, im_fpath in enumerate(cam_im_fpaths):
if i % 50 == 0:
logging.info("\tOn file %s of camera %s of %s", i, camera_name, log_id)
cam_timestamp = Path(im_fpath).stem.split("_")[-1]
cam_timestamp = int(cam_timestamp)
# load PLY file path, e.g. 'PC_315978406032859416.ply'
ply_fpath = dl.get_closest_lidar_fpath(log_id, cam_timestamp)
if ply_fpath is None:
continue
lidar_pts = load_ply(ply_fpath)
save_img_fpath = f"{save_dir}/{camera_name}_{cam_timestamp}.jpg"
if Path(save_img_fpath).exists():
continue
city_to_egovehicle_se3 = dl.get_city_to_egovehicle_se3(log_id, cam_timestamp)
if city_to_egovehicle_se3 is None:
continue
lidar_timestamp = Path(ply_fpath).stem.split("_")[-1]
lidar_timestamp = int(lidar_timestamp)
labels = dl.get_labels_at_lidar_timestamp(log_id, lidar_timestamp)
if labels is None:
logging.info("\tLabels missing at t=%s", lidar_timestamp)
continue
# Swap channel order as OpenCV expects it -- BGR not RGB
# must make a copy to make memory contiguous
img_bgr = imageio.imread(im_fpath)[:, :, ::-1].copy()
camera_config = get_calibration_config(log_calib_data, camera_name)
planes = generate_frustum_planes(camera_config.intrinsic.copy(), camera_name)
plot_img_2d_bboxes(labels, planes, img_bgr, log_calib_data, camera_name, cam_timestamp, lidar_timestamp, data_dir, log_id, save_img_fpath)
if i > 100:
break
category_subdir = "2d_amodal_labels_100fr"
if not Path(f"{experiment_prefix}_{category_subdir}").exists():
os.makedirs(f"{experiment_prefix}_{category_subdir}")
for cam_idx, camera_name in enumerate(RING_CAMERA_LIST + STEREO_CAMERA_LIST):
# Write the cuboid video -- could also write w/ fps=20,30,40
if "stereo" in camera_name:
fps = 5
else:
fps = 30
img_wildcard = f"{save_dir}/{camera_name}_%*.jpg"
output_fpath = f"{experiment_prefix}_{category_subdir}/{log_id}_{camera_name}_{fps}fps.mp4"
write_nonsequential_idx_video(img_wildcard, output_fpath, fps)
def main(args: Any):
"""Run the example."""
log_ids = [log_id.strip() for log_id in args.log_ids.split(",")]
dump_log_2d_bboxes_to_imgs(
log_ids, args.max_num_images_to_render * 9, args.dataset_dir, args.experiment_prefix
)
if __name__ == "__main__":
# Parse command line arguments
parser = argparse.ArgumentParser()
parser.add_argument(
"--max-num-images-to-render", default=5, type=int, help="number of images within which to render 3d cuboids"
)
parser.add_argument("--dataset-dir", type=str, required=True, help="path to the dataset folder")
parser.add_argument(
"--log-ids",
type=str,
required=True,
help="comma separated list of log ids, each log_id represents a log directory, e.g. found at "
" {args.dataset-dir}/argoverse-tracking/train/{log_id} or "
" {args.dataset-dir}/argoverse-tracking/sample/{log_id} or ",
)
parser.add_argument(
"--experiment-prefix",
default="output",
type=str,
help="results will be saved in a folder with this prefix for its name",
)
args = parser.parse_args()
logger.info(args)
if args.log_ids is None:
logger.error(f"Please provide a comma seperated list of log ids")
raise ValueError(f"Please provide a comma seperated list of log ids")
main(args)
Hello @johnwlambert Thank you so much for your help.
But this code is only for visualizations. What I am seeking is having an appropriate GT txt files for each image, containing the box coordinates and the class category. What I see here in this code, that we don't have any information anywhere on the class category of the drawn boxes
Hi @mhusseinsh, you're welcome. To save the ground truth, I recommend JSON instead of txt for its structured format. You can directly save (x1,y1,x2,y2)
to disk after they are computed for each object in the script above, and the object category is available inside the ObjectLabelRecord class as obj_rec.label_class
in the code above.
Hi @johnwlambert
I run the following command
python3 extractGroundTruth.py --dataset-dir argoverse-tracking/val/ --log-ids 033669d3-3d6b-3d3d-bd93-7985d86653ea
A folder is created with name output_033669d3-3d6b-3d3d-bd93-7985d86653ea which contains images inside, a sample is shown below
But afterwards it gave an error related to video creation, and it only created 102 images out of the 900 found in the sequence
Hi @mhusseinsh, in the code above, I break out of the full loop early just to generate short videos as an illustration:
if i > 100:
break
If you comment out those lines, the script will render all images. You can comment out/remove the following image rendering lines if you only want the 2d ground truth:
plt.plot([x1,x1],[y1,y2], 'r')
plt.plot([x1,x2],[y1,y1], 'r')
plt.plot([x1,x2],[y2,y2], 'r')
plt.plot([x2,x2],[y1,y2], 'r')
plt.imshow(img_bgr[:,:,::-1])
plt.savefig(save_img_fpath)
plt.close('all')
@johnwlambert Thanks a lot for your help. It works now perfectly fine. Apart that there are some objects which do have 2 overlapping boxes. I understand that you mentioned this due to the cameras overlapping. But there is no way to prevent this while creating my ground truth? Because definitely this will affect my KPIs calculations
HI @mhusseinsh, you're welcome. Could you discuss the problem you're encountering in more detail? For example, are you referring to (A) overlapping 2d bounding boxes in a single image, or (B) multiple 2d bboxes corresponding to the same object, when captured by multiple cameras simultaneously?
Regarding (A) -- the extents of objects fundamentally overlap in 2d, e.g. this car and bus:
Regarding (B) -- this shouldn't be an issue for training/testing a model to operate on a single camera's output, since the 2d bboxes are correct for any given image. If you are stitching together images and 2d bboxes into a single combined image, we could discuss other options.
Hello @johnwlambert Thanks for your detailed reply. My problem is concerning (B), but I believe this would effect. As an example, I get an image with one vehicle but has as GT 2 BoundingBoxes. Now I run an Object Detector, I get one detection for this vehicle. Using IoU and Hungarian algorithm, I assign the detected box to one of the 2 GT boxes. So it remains one GT box in the labels, which leads to a False Negative in my confusion matrix, and accordingly minimizing my recall value as KPI for my evaluation.
Hi @mhusseinsh, could you point me to a specific example/log-id/frame where you observe this behavior?
As long as you evaluate your detector on the output of a single image using the ground truth for that specific image, your evaluation will be correct. An image with one vehicle visible will be paired with only 1 GT bounding box, if you use the script featured above.
Hi @mhusseinsh, I'm closing this issue for now since I haven't heard back from you. Feel free to re-open it if you have additional questions. Thanks!
Hello, I downloaded the dataset, and I want to run Object Detector on images. Is there a way to extract from the labels the groundtruth 2D bounding boxes of the vehicles, pedestrians and so on?