Occlusion and Truncation Labels for KITTI-360 Objects

abhi1kumar commented 1 year ago

Hi @yiyiliao Thank you for releasing a great dataset.

Each object in the original KITTI dataset has an occlusion and truncation associated with each object, as explained in the KITTI devkit readme. I quote the explanation here:

   1    truncated    Float from 0 (non-truncated) to 1 (truncated), where
                     truncated refers to the object leaving image boundaries
   1    occluded     Integer (0,1,2,3) indicating occlusion state:
                     0 = fully visible, 1 = partly occluded
                     2 = largely occluded, 3 = unknown

The occlusion, truncation, and 2D height classify each KITTI object into Easy, Medium, and Hard categories of objects, as shown here.

I wanted to classify each KITTI-360 object into Easy, Medium and Hard to evaluate with the KITTI 3D detection metric. Since I already know the 2D height of each KITTI-360 object, I wanted to get the truncation and occlusion labels for that object.

Would you mind letting me know how I should calculate the occlusion and truncation of each KITTI-360 object.

yiyiliao commented 1 year ago

Hi, thank you for your interest in our dataset.

It would be possible to evaluate the occlusion and truncation of each object based on the projections of the 3D bounding boxes. For example, one object is not truncated when all 8 vertices of the 3D bounding box lie within the image when projected to 2D (not necessarily true vice-versa though). Similar methods can be used to determine whether an object is occluded or not. This is not 100% accurate but can provide a rough estimation of occlusion and truncation.

You may find this script helpful for projecting the 3D bounding boxes to the 2D image space: https://github.com/autonomousvision/kitti360Scripts/blob/master/kitti360scripts/helpers/project.py#L245-L266

abhi1kumar commented 1 year ago

@yiyiliao Thank you for your quick reply.

Similar methods can be used to determine whether an object is occluded or not. This is not 100% accurate but can provide a rough estimation of occlusion and truncation.

I found occlusion calculation to be tricky. For occlusion calculation of each 3D object, I first calculate its (float) visibility visible_frac, which is the ratio of the number of pixels in the 2D panoptic map divided by the 2D bounding box area box2d_area. Next, I quantized the visible_frac to get occlusion since occlusion labels in the original KITTI are integer labels. The quantization part is tricky since I do not know the correct thresholds.

# 2D information
# global_seg is the global_seg panoptic map
# Get visible bounds of the object in an image
u_min, v_min, u_max, v_max = get_bounds_of_binary_array(global_seg == globalId)

# ==============================================================================
# Occlusion
# ==============================================================================
# Find out the instance ID of the visible bounding boxes based on our 2D instance segmentation maps,
# and then retrieve the corresponding 3D bounding boxes.
# Reference:
# https://github.com/autonomousvision/kitti360Scripts/issues/58#issuecomment-1124445995
box2d_area     = (int(u_max) - int(u_min))*(int(v_max) - int(v_min))
globalId_cnt   = np.sum(global_seg == globalId)
visible_frac   = globalId_cnt/box2d_area if box2d_area > 0 else 0.0

VISIBLE_FRAC_THRESHOLD_1 = 0.6
VISIBLE_FRAC_THRESHOLD_2 = 0.2

if visible_frac > VISIBLE_FRAC_THRESHOLD_1:
    occlusion = 0
elif visible_frac > VISIBLE_FRAC_THRESHOLD_2:
    occlusion = 1
else:
    occlusion = 2

Please let me know if the occlusion calculation and VISIBLE_FRAC_THRESHOLD_1, VISIBLE_FRAC_THRESHOLD_2 are correct. In case they are not, would you mind sharing the correct snippet of occlusion calculation.

It would be possible to evaluate the occlusion and truncation of each object based on the projections of the 3D bounding boxes. For example, one object is not truncated when all 8 vertices of the 3D bounding box lie within the image when projected to 2D (not necessarily true vice-versa though).

Truncation was straightforward to implement.

# ==============================================================================
# Truncation
# ============================================================================== 
# first project vertices in pixel space
uv_vertices = project_3d_points(camera_calib, points_4d= vertices.transpose()).transpose() # 8 x 4
u_min_temp, v_min_temp, _, _ = np.min(uv_vertices, axis= 0)
u_max_temp, v_max_temp, _, _ = np.max(uv_vertices, axis= 0)

# Update projected uv_min if they are larger than min bounds from globalId
# projected uv_max if they are smaller than max bounds from globalId
u_min_temp = u_min if u_min < u_min_temp else u_min_temp
v_min_temp = v_min if v_min < v_min_temp else v_min_temp
u_max_temp = u_max if u_max > u_max_temp else u_max_temp
v_max_temp = v_max if v_max > v_max_temp else v_max_temp

# https://github.com/abhi1kumar/groomed_nms/blob/main/data/kitti_split1/devkit/readme.txt#L55-L72
truncation = 1.0
if u_min < u_max and v_min < v_max and u_min_temp < u_max_temp and v_min_temp < v_max_temp:
    truncation  = 1.0 - ((u_max - u_min)*(v_max - v_min))/((u_max_temp - u_min_temp)*(v_max_temp - v_min_temp))

yiyiliao commented 1 year ago

Regarding the occlusion, I think it makes more sense to check the occlusion of the projected 3D bounding box, i.e., whether a bounding box is occluded by another bounding box or not. You need to project all bounding boxes visible at the current frame (all road classes can be excluded) to see whether they occlude the target bounding box or not.

An alternative could be calculating the number of pixels in the 2D panoptic map divided by the projected 3D bounding box area. If I understood correctly, your 2D bounding box area counts only non-occluded regions.

Unfortunately, I don't know how the thresholds are determined for KITTI. I am afraid that it is hard to align the occlusion metric of KITTI-360 to that of KITTI perfectly as we do not provide a 2D bounding box for the full car (regardless of occlusion). Both methods I mentioned above can only provide a rough estimation of the occlusion and may not be accurate. I can check with my colleague if you think a better alignment with KITTI is needed.

abhi1kumar commented 1 year ago

Thank you once again for your quick reply.

Regarding the occlusion, I think it makes more sense to check the occlusion of the projected 3D bounding box, i.e., whether a bounding box is occluded by another bounding box or not.

If I understood correctly, your 2D bounding box area counts only non-occluded regions.

You are absolutely spot on. I will use the projected 3D bounding box instead of the bounds from the 2D panoptic map.

Unfortunately, I don't know how the thresholds are determined for KITTI. Both methods I mentioned above can only provide a rough estimation of the occlusion and may not be accurate. I can check with my colleague if you think a better alignment with KITTI is needed.

I understand that the occlusion estimates are approximate. However, we are looking for the closest possible alignment with the KITTI dataset to get the 3D detection performance on KITTI-360 Val split with the KITTI metric. Moreover, the fantastic KITTI dataset is also from your autonomousvision group, with Prof. Geiger being the first author.

Hence, it would be awesome if you could let me know the thresholds VISIBLE_FRAC_THRESHOLD_1, VISIBLE_FRAC_THRESHOLD_2 and other details with your colleague or Prof. Geiger to replicate the occlusion label closely.

yiyiliao commented 1 year ago

I asked Andreas. For KITTI this is done by the annotators. They are asked to use thresholds of roughly 20% for VISIBLE_FRAC_THRESHOLD_1 and 80% for VISIBLE_FRAC_THRESHOLD_2, where 20% and 80% are the ratios of visible pixels per object.

If you want to calculate the visibility ratio based on the number of pixels divided by the area of projected 3D bounding boxes, you may want to first calculate the average ratio for all non-occluded objects (as it is smaller than 100% even when the object is non-occluded). This average ratio can be used then as a reference to determine the VISIBLE_FRAC_THRESHOLD for occluded objects.

abhi1kumar commented 1 year ago

They are asked to use thresholds of roughly 20% for VISIBLE_FRAC_THRESHOLD_1 and 80% for VISIBLE_FRAC_THRESHOLD_2, where 20% and 80% are the ratios of visible pixels per object.

Thank you, @yiyiliao, for asking Prof. Geiger and telling the exact values of thresholds VISIBLE_FRAC_THRESHOLD_1, VISIBLE_FRAC_THRESHOLD_2.

If you want to calculate the visibility ratio based on the number of pixels divided by the area of projected 3D bounding boxes, you may want to first calculate the average ratio for all non-occluded objects (as it is smaller than 100% even when the object is non-occluded). This average ratio can be used then as a reference to determine the VISIBLE_FRAC_THRESHOLD for occluded objects.

Wao! That is a great suggestion as well.

Thank you once again for helping us out.

autonomousvision / kitti360Scripts

Occlusion and Truncation Labels for KITTI-360 Objects #78