Closed JPenshornTAP closed 7 months ago
It may be caused by the ray-object intersection computation, we will optimize that in the future. For a quick fix, you can delete the unused variables in the dataparsers, there is some redundant code from NSG that is not necessary.
met the same issue here, did you find any walkaround? @JPenshornTAP Thank you.
met the same issue here,how to compress the cost of RAM like torch.dataloader()
Here's a refactored dataparser for the KITTI dataset. However, some minor changes should be made to plug it into the current mars release version. Hope this could help you guys with the memory issue. This script will be in some future release version, stay tuned~
from __future__ import annotations
import os
from dataclasses import dataclass, field
from pathlib import Path
from typing import List, Optional, Type
import imageio
import numpy as np
import pandas as pd
import torch
from rich.console import Console
from mars.configs.storage_configs.base_storage_config import StorageConfig
from mars.instances.instances import Instances
from mars.utils.neural_scene_graph_helper import box_pts
from nerfstudio.cameras import camera_utils
from nerfstudio.cameras.cameras import Cameras, CameraType
from nerfstudio.data.dataparsers.base_dataparser import (
DataParser,
DataParserConfig,
DataparserOutputs,
Semantics,
)
from nerfstudio.data.scene_box import SceneBox
from nerfstudio.plugins.registry_dataparser import DataParserSpecification
from nerfstudio.utils.colors import get_color
CONSOLE = Console(width=120)
_sem2label = {"Misc": -1, "Car": 0, "Van": 0, "Truck": 2, "Tram": 3, "Pedestrian": 4}
camera_ls = [2, 3]
def kitti_string_to_float(str):
return float(str.split("e")[0]) * 10 ** int(str.split("e")[1])
def get_rotation(roll, pitch, heading):
s_heading = np.sin(heading)
c_heading = np.cos(heading)
rot_z = np.array([[c_heading, -s_heading, 0], [s_heading, c_heading, 0], [0, 0, 1]])
s_pitch = np.sin(pitch)
c_pitch = np.cos(pitch)
rot_y = np.array([[c_pitch, 0, s_pitch], [0, 1, 0], [-s_pitch, 0, c_pitch]])
s_roll = np.sin(roll)
c_roll = np.cos(roll)
rot_x = np.array([[1, 0, 0], [0, c_roll, -s_roll], [0, s_roll, c_roll]])
rot = np.matmul(rot_z, np.matmul(rot_y, rot_x))
return rot
def invert_transformation(rot, t):
t = np.matmul(-rot.T, t)
inv_translation = np.concatenate([rot.T, t[:, None]], axis=1)
return np.concatenate([inv_translation, np.array([[0.0, 0.0, 0.0, 1.0]])])
def calib_from_txt(calibration_path):
"""
Read the calibration files and extract the required transformation matrices and focal length.
Args:
calibration_path (str): The path to the directory containing the calibration files.
Returns:
tuple: A tuple containing the following elements:
traimu2v (np.array): 4x4 transformation matrix from IMU to Velodyne coordinates.
v2c (np.array): 4x4 transformation matrix from Velodyne to left camera coordinates.
c2leftRGB (np.array): 4x4 transformation matrix from left camera to rectified left camera coordinates.
c2rightRGB (np.array): 4x4 transformation matrix from right camera to rectified right camera coordinates.
focal (float): Focal length of the left camera.
"""
c2c = []
# Read and parse the camera-to-camera calibration file
f = open(os.path.join(calibration_path, "calib_cam_to_cam.txt"), "r")
cam_to_cam_str = f.read()
[left_cam, right_cam] = cam_to_cam_str.split("S_02: ")[1].split("S_03: ")
cam_to_cam_ls = [left_cam, right_cam]
# Extract the transformation matrices for left and right cameras
for i, cam_str in enumerate(cam_to_cam_ls):
r_str, t_str = cam_str.split("R_0" + str(i + 2) + ": ")[1].split("\nT_0" + str(i + 2) + ": ")
t_str = t_str.split("\n")[0]
R = np.array([kitti_string_to_float(r) for r in r_str.split(" ")])
R = np.reshape(R, [3, 3])
t = np.array([kitti_string_to_float(t) for t in t_str.split(" ")])
Tr = np.concatenate([np.concatenate([R, t[:, None]], axis=1), np.array([0.0, 0.0, 0.0, 1.0])[None, :]])
t_str_rect, s_rect_part = cam_str.split("\nT_0" + str(i + 2) + ": ")[1].split("\nS_rect_0" + str(i + 2) + ": ")
s_rect_str, r_rect_part = s_rect_part.split("\nR_rect_0" + str(i + 2) + ": ")
r_rect_str = r_rect_part.split("\nP_rect_0" + str(i + 2) + ": ")[0]
R_rect = np.array([kitti_string_to_float(r) for r in r_rect_str.split(" ")])
R_rect = np.reshape(R_rect, [3, 3])
t_rect = np.array([kitti_string_to_float(t) for t in t_str_rect.split(" ")])
Tr_rect = np.concatenate(
[np.concatenate([R_rect, t_rect[:, None]], axis=1), np.array([0.0, 0.0, 0.0, 1.0])[None, :]]
)
c2c.append(Tr_rect)
c2leftRGB = c2c[0]
c2rightRGB = c2c[1]
# Read and parse the Velodyne-to-camera calibration file
f = open(os.path.join(calibration_path, "calib_velo_to_cam.txt"), "r")
velo_to_cam_str = f.read()
r_str, t_str = velo_to_cam_str.split("R: ")[1].split("\nT: ")
t_str = t_str.split("\n")[0]
R = np.array([kitti_string_to_float(r) for r in r_str.split(" ")])
R = np.reshape(R, [3, 3])
t = np.array([kitti_string_to_float(r) for r in t_str.split(" ")])
v2c = np.concatenate([np.concatenate([R, t[:, None]], axis=1), np.array([0.0, 0.0, 0.0, 1.0])[None, :]])
# Read and parse the IMU-to-Velodyne calibration file
f = open(os.path.join(calibration_path, "calib_imu_to_velo.txt"), "r")
imu_to_velo_str = f.read()
r_str, t_str = imu_to_velo_str.split("R: ")[1].split("\nT: ")
R = np.array([kitti_string_to_float(r) for r in r_str.split(" ")])
R = np.reshape(R, [3, 3])
t = np.array([kitti_string_to_float(r) for r in t_str.split(" ")])
imu2v = np.concatenate([np.concatenate([R, t[:, None]], axis=1), np.array([0.0, 0.0, 0.0, 1.0])[None, :]])
# Extract the focal length of the left camera
focal = kitti_string_to_float(left_cam.split("P_rect_02: ")[1].split()[0])
return imu2v, v2c, c2leftRGB, c2rightRGB, focal
def tracking_calib_from_txt(calibration_path):
"""
Extract tracking calibration information from a KITTI tracking calibration file.
This function reads a KITTI tracking calibration file and extracts the relevant
calibration information, including projection matrices and transformation matrices
for camera, LiDAR, and IMU coordinate systems.
Args:
calibration_path (str): Path to the KITTI tracking calibration file.
Returns:
dict: A dictionary containing the following calibration information:
P0, P1, P2, P3 (np.array): 3x4 projection matrices for the cameras.
Tr_cam2camrect (np.array): 4x4 transformation matrix from camera to rectified camera coordinates.
Tr_velo2cam (np.array): 4x4 transformation matrix from LiDAR to camera coordinates.
Tr_imu2velo (np.array): 4x4 transformation matrix from IMU to LiDAR coordinates.
"""
# Read the calibration file
f = open(calibration_path)
calib_str = f.read().splitlines()
# Process the calibration data
calibs = []
for calibration in calib_str:
calibs.append(np.array([kitti_string_to_float(val) for val in calibration.split()[1:]]))
# Extract the projection matrices
P0 = np.reshape(calibs[0], [3, 4])
P1 = np.reshape(calibs[1], [3, 4])
P2 = np.reshape(calibs[2], [3, 4])
P3 = np.reshape(calibs[3], [3, 4])
# Extract the transformation matrix for camera to rectified camera coordinates
Tr_cam2camrect = np.eye(4)
R_rect = np.reshape(calibs[4], [3, 3])
Tr_cam2camrect[:3, :3] = R_rect
# Extract the transformation matrices for LiDAR to camera and IMU to LiDAR coordinates
Tr_velo2cam = np.concatenate([np.reshape(calibs[5], [3, 4]), np.array([[0.0, 0.0, 0.0, 1.0]])], axis=0)
Tr_imu2velo = np.concatenate([np.reshape(calibs[6], [3, 4]), np.array([[0.0, 0.0, 0.0, 1.0]])], axis=0)
return {
"P0": P0,
"P1": P1,
"P2": P2,
"P3": P3,
"Tr_cam2camrect": Tr_cam2camrect,
"Tr_velo2cam": Tr_velo2cam,
"Tr_imu2velo": Tr_imu2velo,
}
def get_poses_calibration(basedir, oxts_path_tracking=None, selected_frames=None):
"""
Extract poses and calibration information from the KITTI dataset.
This function processes the OXTS data (GPS/IMU) and extracts the
pose information (translation and rotation) for each frame. It also
retrieves the calibration information (transformation matrices and focal length)
required for further processing.
Args:
basedir (str): The base directory containing the KITTI dataset.
oxts_path_tracking (str, optional): Path to the OXTS data file for tracking sequences.
If not provided, the function will look for OXTS data in the basedir.
selected_frames (list, optional): A list of frame indices to process.
If not provided, all frames in the dataset will be processed.
Returns:
tuple: A tuple containing the following elements:
poses (np.array): An array of 4x4 pose matrices representing the vehicle's
position and orientation for each frame (IMU pose).
calibrations (dict): A dictionary containing the transformation matrices
and focal length obtained from the calibration files.
focal (float): The focal length of the left camera.
"""
def oxts_to_pose(oxts):
"""
OXTS (Oxford Technical Solutions) data typically refers to the data generated by an Inertial and GPS Navigation System (INS/GPS) that is used to provide accurate position, orientation, and velocity information for a moving platform, such as a vehicle. In the context of the KITTI dataset, OXTS data is used to provide the ground truth for the vehicle's trajectory and 6 degrees of freedom (6-DoF) motion, which is essential for evaluating and benchmarking various computer vision and robotics algorithms, such as visual odometry, SLAM, and object detection.
The OXTS data contains several important measurements:
1. Latitude, longitude, and altitude: These are the global coordinates of the moving platform.
2. Roll, pitch, and yaw (heading): These are the orientation angles of the platform, usually given in Euler angles.
3. Velocity (north, east, and down): These are the linear velocities of the platform in the local navigation frame.
4. Accelerations (ax, ay, az): These are the linear accelerations in the platform's body frame.
5. Angular rates (wx, wy, wz): These are the angular rates (also known as angular velocities) of the platform in its body frame.
In the KITTI dataset, the OXTS data is stored as plain text files with each line corresponding to a timestamp. Each line in the file contains the aforementioned measurements, which are used to compute the ground truth trajectory and 6-DoF motion of the vehicle. This information can be further used for calibration, data synchronization, and performance evaluation of various algorithms.
"""
poses = []
def latlon_to_mercator(lat, lon, s):
"""
Converts latitude and longitude coordinates to Mercator coordinates (x, y) using the given scale factor.
The Mercator projection is a widely used cylindrical map projection that represents the Earth's surface
as a flat, rectangular grid, distorting the size of geographical features in higher latitudes.
This function uses the scale factor 's' to control the amount of distortion in the projection.
Args:
lat (float): Latitude in degrees, range: -90 to 90.
lon (float): Longitude in degrees, range: -180 to 180.
s (float): Scale factor, typically the cosine of the reference latitude.
Returns:
list: A list containing the Mercator coordinates [x, y] in meters.
"""
r = 6378137.0 # the Earth's equatorial radius in meters
x = s * r * ((np.pi * lon) / 180)
y = s * r * np.log(np.tan((np.pi * (90 + lat)) / 360))
return [x, y]
# Compute the initial scale and pose based on the selected frames
if selected_frames is None:
lat0 = oxts[0][0]
scale = np.cos(lat0 * np.pi / 180)
pose_0_inv = None
else:
oxts0 = oxts[selected_frames[0][0]]
lat0 = oxts0[0]
scale = np.cos(lat0 * np.pi / 180)
pose_i = np.eye(4)
[x, y] = latlon_to_mercator(oxts0[0], oxts0[1], scale)
z = oxts0[2]
translation = np.array([x, y, z])
rotation = get_rotation(oxts0[3], oxts0[4], oxts0[5])
pose_i[:3, :] = np.concatenate([rotation, translation[:, None]], axis=1)
pose_0_inv = invert_transformation(pose_i[:3, :3], pose_i[:3, 3])
# Iterate through the OXTS data and compute the corresponding pose matrices
for oxts_val in oxts:
pose_i = np.zeros([4, 4])
pose_i[3, 3] = 1
[x, y] = latlon_to_mercator(oxts_val[0], oxts_val[1], scale)
z = oxts_val[2]
translation = np.array([x, y, z])
roll = oxts_val[3]
pitch = oxts_val[4]
heading = oxts_val[5]
rotation = get_rotation(roll, pitch, heading) # (3,3)
pose_i[:3, :] = np.concatenate([rotation, translation[:, None]], axis=1) # (4, 4)
if pose_0_inv is None:
pose_0_inv = invert_transformation(pose_i[:3, :3], pose_i[:3, 3])
pose_i = np.matmul(pose_0_inv, pose_i)
poses.append(pose_i)
return np.array(poses)
# If there is no tracking path specified, use the default path
if oxts_path_tracking is None:
oxts_path = os.path.join(basedir, "oxts/data")
oxts = np.array([np.loadtxt(os.path.join(oxts_path, file)) for file in sorted(os.listdir(oxts_path))])
calibration_path = os.path.dirname(basedir)
calibrations = calib_from_txt(calibration_path)
focal = calibrations[4]
poses = oxts_to_pose(oxts)
# If a tracking path is specified, use it to load OXTS data and compute the poses
else:
oxts_tracking = np.loadtxt(oxts_path_tracking)
poses = oxts_to_pose(oxts_tracking) # (n_frames, 4, 4)
calibrations = None
focal = None
# Set velodyne close to z = 0
# poses[:, 2, 3] -= 0.8
# Return the poses, calibrations, and focal length
return poses, calibrations, focal
def get_camera_poses_tracking(poses_velo_w_tracking, tracking_calibration, selected_frames, scene_no=None):
exp = False
camera_poses = []
opengl2kitti = np.array([[1, 0, 0, 0], [0, -1, 0, 0], [0, 0, -1, 0], [0, 0, 0, 1]])
start_frame = selected_frames[0]
end_frame = selected_frames[1]
#####################
# Debug Camera offset
if scene_no == 2:
yaw = np.deg2rad(0.7) ## Affects camera rig roll: High --> counterclockwise
pitch = np.deg2rad(-0.5) ## Affects camera rig yaw: High --> Turn Right
# pitch = np.deg2rad(-0.97)
roll = np.deg2rad(0.9) ## Affects camera rig pitch: High --> up
# roll = np.deg2rad(1.2)
elif scene_no == 1:
if exp:
yaw = np.deg2rad(0.3) ## Affects camera rig roll: High --> counterclockwise
pitch = np.deg2rad(-0.6) ## Affects camera rig yaw: High --> Turn Right
# pitch = np.deg2rad(-0.97)
roll = np.deg2rad(0.75) ## Affects camera rig pitch: High --> up
# roll = np.deg2rad(1.2)
else:
yaw = np.deg2rad(0.5) ## Affects camera rig roll: High --> counterclockwise
pitch = np.deg2rad(-0.5) ## Affects camera rig yaw: High --> Turn Right
roll = np.deg2rad(0.75) ## Affects camera rig pitch: High --> up
else:
yaw = np.deg2rad(0.05)
pitch = np.deg2rad(-0.75)
# pitch = np.deg2rad(-0.97)
roll = np.deg2rad(1.05)
# roll = np.deg2rad(1.2)
cam_debug = np.eye(4)
cam_debug[:3, :3] = get_rotation(roll, pitch, yaw)
Tr_cam2camrect = tracking_calibration["Tr_cam2camrect"]
Tr_cam2camrect = np.matmul(Tr_cam2camrect, cam_debug)
Tr_camrect2cam = invert_transformation(Tr_cam2camrect[:3, :3], Tr_cam2camrect[:3, 3])
Tr_velo2cam = tracking_calibration["Tr_velo2cam"]
Tr_cam2velo = invert_transformation(Tr_velo2cam[:3, :3], Tr_velo2cam[:3, 3])
camera_poses_imu = []
for cam in camera_ls:
Tr_camrect2cam_i = tracking_calibration["Tr_camrect2cam0" + str(cam)]
Tr_cam_i2camrect = invert_transformation(Tr_camrect2cam_i[:3, :3], Tr_camrect2cam_i[:3, 3])
# transform camera axis from kitti to opengl for nerf:
cam_i_camrect = np.matmul(Tr_cam_i2camrect, opengl2kitti)
cam_i_cam0 = np.matmul(Tr_camrect2cam, cam_i_camrect)
cam_i_velo = np.matmul(Tr_cam2velo, cam_i_cam0)
cam_i_w = np.matmul(poses_velo_w_tracking, cam_i_velo)
camera_poses_imu.append(cam_i_w)
for i, cam in enumerate(camera_ls):
for frame_no in range(start_frame, end_frame + 1):
camera_poses.append(camera_poses_imu[i][frame_no])
return np.array(camera_poses)
def get_obj_pose_tracking(tracklet_path, poses_imu_tracking, calibrations, selected_frames, transform_matrix):
"""
Extracts object pose information from the KITTI motion tracking dataset for the specified frames.
Parameters
----------
tracklet_path : str
Path to the text file containing tracklet information. A tracklet is a small sequence of object positions and orientations over time, often used in the context of object tracking and motion estimation in computer vision. In a dataset, a tracklet usually represents a single object's pose information across multiple consecutive frames. This information includes the object's position, orientation (usually as rotation around the vertical axis, i.e., yaw angle), and other attributes like object type, dimensions, etc. In the KITTI dataset, tracklets are used to store and provide ground truth information about dynamic objects in the scene, such as cars, pedestrians, and cyclists.
poses_imu_tracking : list of numpy arrays
A list of 4x4 transformation matrices representing the poses (positions and orientations) of the ego vehicle A(the main vehicle equipped with sensors) in Inertial Measurement Unit (IMU) coordinates at different time instances (frames). Each matrix in the list corresponds to a single frame in the dataset.
calibrations : dict
Dictionary containing calibration information:
- "Tr_velo2cam": 3x4 transformation matrix from Velodyne coordinates to camera coordinates.
- "Tr_imu2velo": 3x4 transformation matrix from IMU coordinates to Velodyne coordinates.
selected_frames : list of int
List of two integers specifying the start and end frames to process.
Returns
-------
Object Instance instance.
"""
# Helper function to generate a rotation matrix around the y-axis
def roty_matrix(roty):
c = np.cos(roty)
s = np.sin(roty)
return np.array([[c, 0, s], [0, 1, 0], [-s, 0, c]])
# Extract calibration data
velo2cam = calibrations["Tr_velo2cam"]
imu2velo = calibrations["Tr_imu2velo"]
cam2velo = invert_transformation(velo2cam[:3, :3], velo2cam[:3, 3])
velo2imu = invert_transformation(imu2velo[:3, :3], imu2velo[:3, 3])
start_frame = selected_frames[0]
end_frame = selected_frames[1]
# Read tracklets from file
f = open(tracklet_path)
tracklets_str = f.read().splitlines()
track_ids, category_ids, lengths, heights, widths = [], [], [], [], []
track2index = {}
centers_list = []
yaws_list = []
timestamps_list = []
# Extract metadata for all objects in the scene
for tracklet in tracklets_str:
tracklet = tracklet.split()
if float(tracklet[1]) < 0:
continue
frame_id = int(tracklet[0])
if frame_id < start_frame or frame_id > end_frame:
continue
track_id = int(tracklet[1])
if not tracklet[2] in _sem2label:
continue
type = _sem2label[tracklet[2]]
if not track_id in track2index: # extract metadata
height = float(tracklet[10])
width = float(tracklet[11])
length = float(tracklet[12])
track2index[track_id] = len(track_ids)
track_ids.append(track_id)
category_ids.append(type)
heights.append(height)
widths.append(width)
lengths.append(length)
# create empty tensors for centers and yaws
centers_list.append(torch.zeros((end_frame - start_frame + 1, 3), dtype=torch.float64))
yaws_list.append(torch.zeros((end_frame - start_frame + 1, 1), dtype=torch.float64))
timestamps_list.append(-1.0 * torch.ones((end_frame - start_frame + 1, 1), dtype=torch.float64))
# extract object track
index = track2index[track_id]
timestamps_list[index][frame_id - start_frame] = float(frame_id - start_frame)
# Initialize a 4x4 identity matrix for object pose in camera coordinates
pose = np.array(
[float(tracklet[13]), float(tracklet[14]), float(tracklet[15]), float(tracklet[16])]
) # x,y,z,yaw
obj_pose_c = np.eye(4)
obj_pose_c[:3, 3] = pose[:3]
roty = pose[3]
obj_pose_c[:3, :3] = roty_matrix(roty)
# Transform object pose from camera coordinates to IMU coordinates
obj_pose_imu = np.matmul(velo2imu, np.matmul(cam2velo, obj_pose_c))
# Get the IMU pose for the corresponding frame
pose_imu_w_frame_i = poses_imu_tracking[int(frame_id)]
# Calculate the world pose of the object
pose_obj_w_i = np.matmul(pose_imu_w_frame_i, obj_pose_imu)
pose_obj_w_i = np.matmul(transform_matrix, pose_obj_w_i)
# pose_obj_w_i[:, 3] *= scale_factor
# Calculate the approximate yaw angle of the object in the world frame
yaw_aprox = -np.arctan2(pose_obj_w_i[1, 0], pose_obj_w_i[0, 0])
# store center and yaw
centers_list[index][frame_id - start_frame] = torch.from_numpy(pose_obj_w_i[:3, 3])
yaws_list[index][frame_id - start_frame] = yaw_aprox
max_obj_per_frame = len(track_ids)
instances_obj = Instances(
width=torch.tensor(widths),
height=torch.tensor(heights),
length=torch.tensor(lengths),
centers=torch.stack(centers_list, dim=0),
yaws=torch.stack(yaws_list, dim=0),
track_ids=torch.tensor(track_ids),
category_ids=torch.tensor(category_ids),
timestamps=torch.stack(timestamps_list, dim=0),
valid_mask=torch.stack(timestamps_list) > 0,
)
return instances_obj
def get_scene_images_tracking(
tracking_path,
sequence,
selected_frames,
use_depth=False,
use_semantic=False,
semantic_path=None,
pred_normals=False,
):
[start_frame, end_frame] = selected_frames
# imgs = []
img_name = []
depth_name = []
semantic_name = []
normal_name = []
left_img_path = os.path.join(os.path.join(tracking_path, "image_02"), sequence)
right_img_path = os.path.join(os.path.join(tracking_path, "image_03"), sequence)
left_normal_path = os.path.join(os.path.join(tracking_path, "normal_02"), sequence)
right_normal_path = os.path.join(os.path.join(tracking_path, "normal_03"), sequence)
if pred_normals:
for frame_dir in [left_normal_path, right_normal_path]:
for frame_no in range(len(os.listdir(left_normal_path))):
if start_frame <= frame_no <= end_frame:
frame = sorted(os.listdir(frame_dir))[frame_no]
fname = os.path.join(frame_dir, frame)
normal_name.append(fname)
if use_depth:
left_depth_path = os.path.join(os.path.join(tracking_path, "completion_02"), sequence)
right_depth_path = os.path.join(os.path.join(tracking_path, "completion_03"), sequence)
for frame_dir in [left_img_path, right_img_path]:
for frame_no in range(len(os.listdir(left_img_path))):
if start_frame <= frame_no <= end_frame:
frame = sorted(os.listdir(frame_dir))[frame_no]
fname = os.path.join(frame_dir, frame)
# imgs.append(imageio.imread(fname))
img_name.append(fname)
if use_depth:
for frame_dir in [left_depth_path, right_depth_path]:
for frame_no in range(len(os.listdir(left_depth_path))):
if start_frame <= frame_no <= end_frame:
frame = sorted(os.listdir(frame_dir))[frame_no]
fname = os.path.join(frame_dir, frame)
depth_name.append(fname)
if use_semantic:
frame_dir = os.path.join(semantic_path, "train", sequence)
for _ in range(2):
for frame_no in range(len(os.listdir(frame_dir))):
if start_frame <= frame_no <= end_frame:
frame = sorted(os.listdir(frame_dir))[frame_no]
fname = os.path.join(frame_dir, frame)
semantic_name.append(fname)
# imgs = (np.maximum(np.minimum(np.array(imgs), 255), 0) / 255.0).astype(np.float32)
return img_name, depth_name, semantic_name, normal_name
def get_rays_np(H, W, focal, c2w):
"""Get ray origins, directions from a pinhole camera."""
# Numpy Version
i, j = np.meshgrid(np.arange(W, dtype=np.float32), np.arange(H, dtype=np.float32), indexing="xy")
dirs = np.stack([(i - W * 0.5) / focal, -(j - H * 0.5) / focal, -np.ones_like(i)], -1)
rays_d = np.sum(dirs[..., np.newaxis, :] * c2w[:3, :3], -1)
rays_o = np.broadcast_to(c2w[:3, -1], np.shape(rays_d))
return rays_o, rays_d
def rotate_yaw(p, yaw):
"""Rotates p with yaw in the given coord frame with y being the relevant axis and pointing downwards
Args:
p: 3D points in a given frame [N_pts, N_frames, 3]/[N_pts, N_frames, N_samples, 3]
yaw: Rotation angle
Returns:
p: Rotated points [N_pts, N_frames, N_samples, 3]
"""
# p of size [batch_rays, n_obj, samples, xyz]
if len(p.shape) < 4:
# p = p[..., tf.newaxis, :]
p = p.unsqueeze(-2)
c_y = torch.cos(yaw)
s_y = torch.sin(yaw)
if len(c_y.shape) < 3:
c_y = c_y.unsqueeze(-1)
s_y = s_y.unsqueeze(-1)
# c_y = tf.cos(yaw)[..., tf.newaxis]
# s_y = tf.sin(yaw)[..., tf.newaxis]
p_x = c_y * p[..., 0] - s_y * p[..., 2]
p_y = p[..., 1]
p_z = s_y * p[..., 0] + c_y * p[..., 2]
# return tf.concat([p_x[..., tf.newaxis], p_y[..., tf.newaxis], p_z[..., tf.newaxis]], axis=-1)
return torch.cat([p_x.unsqueeze(-1), p_y.unsqueeze(-1), p_z.unsqueeze(-1)], dim=-1)
def scale_frames(p, sc_factor, inverse=False):
"""Scales points given in N_frames in each dimension [xyz] for each frame or rescales for inverse==True
Args:
p: Points given in N_frames frames [N_points, N_frames, N_samples, 3]
sc_factor: Scaling factor for new frame [N_points, N_frames, 3]
inverse: Inverse scaling if true, bool
Returns:
p_scaled: Points given in N_frames rescaled frames [N_points, N_frames, N_samples, 3]
"""
# Take 150% of bbox to include shadows etc.
dim = torch.tensor([1.0, 1.0, 1.0]) * sc_factor
# dim = tf.constant([0.1, 0.1, 0.1]) * sc_factor
half_dim = dim / 2
scaling_factor = (1 / (half_dim + 1e-9)).unsqueeze(-2)
# scaling_factor = (1 / (half_dim + 1e-9))[:, :, tf.newaxis, :]
if not inverse:
p_scaled = scaling_factor * p
else:
p_scaled = (1.0 / scaling_factor) * p
return p_scaled
def get_all_ray_3dbox_intersection(rays_rgb, obj_meta_tensor, chunk, local=False, obj_to_remove=-100):
"""get all rays hitting an oject given 3D multi-object-tracking results of a sequence
Args:
rays_rgb: All rays
obj_meta_tensor: Metadata of all objects
chunk: No. of rays processed at the same time
local: Limit used memory if processed on a local machine with limited CPU/GPU resources
obj_to_remove: If object should be removed from the set of rays
Returns:
rays_on_obj: Set of all rays hitting at least one object
rays_to_remove: Set of all rays hitting an object, that should not be trained
"""
print("Removing object ", obj_to_remove)
rays_on_obj = np.array([])
rays_to_remove = np.array([])
_batch_sz_inter = chunk if not local else 5000 # args.chunk
_only_intersect_rays_rgb = rays_rgb[0][None]
_n_rays = rays_rgb.shape[0]
_n_obj = (rays_rgb.shape[1] - 3) // 2
_n_bt = np.ceil(_n_rays / _batch_sz_inter).astype(np.int32)
for i in range(_n_bt):
_tf_rays_rgb = torch.from_numpy(rays_rgb[i * _batch_sz_inter : (i + 1) * _batch_sz_inter]).to(torch.float32)
# _tf_rays_rgb = tf.cast(rays_rgb[i * _batch_sz_inter:(i + 1) * _batch_sz_inter], tf.float32)
_n_bt_i = _tf_rays_rgb.shape[0]
_rays_bt = [_tf_rays_rgb[:, 0, :], _tf_rays_rgb[:, 1, :]]
_objs = torch.reshape(_tf_rays_rgb[:, 3:, :], (_n_bt_i, _n_obj, 6))
# _objs = tf.reshape(_tf_rays_rgb[:, 3:, :], [_n_bt_i, _n_obj, 6])
_obj_pose = _objs[..., :3]
_obj_theta = _objs[..., 3]
_obj_id = _objs[..., 4].to(torch.int64)
# _obj_id = tf.cast(_objs[..., 4], tf.int32)
_obj_meta = torch.index_select(obj_meta_tensor, 0, _obj_id.reshape(-1)).reshape(
-1, _obj_id.shape[1], obj_meta_tensor.shape[1]
)
# _obj_meta = tf.gather(obj_meta_tensor, _obj_id, axis=0)
_obj_track_id = _obj_meta[..., 0].unsqueeze(-1)
_obj_dim = _obj_meta[..., 1:4]
box_points_insters = box_pts(_rays_bt, _obj_pose, _obj_theta, _obj_dim, one_intersec_per_ray=False)
_mask = box_points_insters[8]
if _mask is not None:
if rays_on_obj.any():
rays_on_obj = np.concatenate([rays_on_obj, np.array(i * _batch_sz_inter + (_mask[:, 0]).cpu().numpy())])
else:
rays_on_obj = np.array(i * _batch_sz_inter + _mask[:, 0].cpu().numpy())
if obj_to_remove is not None:
_hit_id = _obj_track_id[_mask]
import pdb
pdb.set_trace()
# _hit_id = tf.gather_nd(_obj_track_id, _mask)
# bool_remove = tf.equal(_hit_id, obj_to_remove)
bool_remove = np.equal(_hit_id, obj_to_remove)
if any(bool_remove):
# _remove_mask = tf.gather_nd(_mask, tf.where(bool_remove))
_remove_mask = np.array(_mask[:, 0])[np.where(np.equal(_hit_id, obj_to_remove))[0]]
if rays_to_remove.any():
rays_to_remove = np.concatenate([rays_to_remove, np.array(i * _batch_sz_inter + _remove_mask)])
else:
rays_to_remove = np.array(i * _batch_sz_inter + _remove_mask)
return rays_on_obj, rays_to_remove, box_points_insters
@dataclass
class MarsKittiDataParserConfig(DataParserConfig):
"""nerual scene graph dataset parser config"""
_target: Type = field(default_factory=lambda: MarsKittiDataparser)
"""target class to instantiate"""
data: Path = Path("data/kitti/training/image_02/0005")
"""Directory specifying location of data."""
sequence_id: str = "0006"
"""Sequence ID of the KITTI data."""
scale_factor: float = 1
"""How much to scale the camera origins by."""
scene_scale: float = 1.0
"""How much to scale the region of interest by."""
alpha_color: str = "white"
"""alpha color of background"""
first_frame: int = 65
"""specifies the beginning of a sequence if not the complete scene is taken as Input"""
last_frame: int = 120
"""specifies the end of a sequence"""
box_scale: float = 1.5
"""Maximum scale for bboxes to include shadows"""
near_plane: float = 0.5
"""specifies the distance from the last pose to the near plane"""
far_plane: float = 150.0
"""specifies the distance from the last pose to the far plane"""
netchunk: int = 1024 * 64
"""number of pts sent through network in parallel, decrease if running out of memory"""
chunk: int = 1024 * 32
"""number of rays processed in parallel, decrease if running out of memory"""
use_car_latents: bool = True
car_object_latents_path: Optional[Path] = Path(f"pretrain/car_nerf/kitti/latent_codes{sequence_id[-2:]}.pt")
"""path of car object latent codes"""
car_nerf_state_dict_path: Optional[Path] = Path("pretrain/car_nerf/kitti/epoch_670.ckpt")
"""path of car nerf state dicts"""
use_depth: bool = True
"""whether the training loop contains depth"""
split_setting: str = "reconstruction"
use_semantic: bool = False
"""whether to use semantic information"""
semantic_path: Optional[Path] = Path("")
"""path of semantic inputs"""
semantic_mask_classes: List[str] = field(default_factory=lambda: [])
"""semantic classes that do not generate gradient to the background model"""
storage_config: StorageConfig = StorageConfig()
"""machine specific configs."""
pred_normals: bool = False
"""whether to use normals information"""
@dataclass
class MarsKittiDataparser(DataParser):
"""nerual scene graph kitti Dataset"""
config: MarsKittiDataParserConfig
def __init__(self, config: MarsKittiDataParserConfig):
super().__init__(config=config)
self.data: Path = config.data
self.scale_factor: float = config.scale_factor
self.alpha_color = config.alpha_color
self.selected_frames = [config.first_frame, config.last_frame]
self.use_time = False
self.remove = -1
self.near = config.near_plane
self.far = config.far_plane
self.netchunk = config.netchunk
self.chunk = config.chunk
self.use_semantic = config.use_semantic
self.semantic_path: Path = config.storage_config.datapath_dict["KITTI-MOT-home"] / "panoptic_maps"
self.pred_normals = config.pred_normals
if "KITTI-MOT-home" in self.config.storage_config.datapath_dict:
self.data = (
self.config.storage_config.datapath_dict["KITTI-MOT-home"]
/ "training"
/ "image_02"
/ self.config.sequence_id
)
self.config.data = self.data
assert self.config.data.exists()
if "CarNeRF-latents" in self.config.storage_config.datapath_dict:
self.config.car_object_latents_path = (
self.config.storage_config.datapath_dict["CarNeRF-latents"]
/ "kitti_mot"
/ "latents"
/ f"latent_codes{self.config.sequence_id[-2:]}.pt"
)
assert self.config.car_object_latents_path.exists()
if "CarNeRF-pretrained-model" in self.config.storage_config.datapath_dict:
self.config.car_nerf_state_dict_path = (
self.config.storage_config.datapath_dict["CarNeRF-pretrained-model"] / "epoch_670.ckpt"
)
assert self.config.car_nerf_state_dict_path.exists()
def _generate_dataparser_outputs(self, split="train"):
visible_objects_ls = []
objects_meta_ls = []
semantic_meta = []
kitti2vkitti = np.array(
[[1.0, 0.0, 0.0, 0.0], [0.0, 0.0, -1.0, 0.0], [0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 0.0, 1.0]]
)
if self.alpha_color is not None:
alpha_color_tensor = get_color(self.alpha_color)
else:
alpha_color_tensor = None
basedir = str(self.data)
scene_id = basedir[-4:] # check
kitti_scene_no = int(scene_id)
tracking_path = basedir[:-13] # check
calibration_path = os.path.join(os.path.join(tracking_path, "calib"), scene_id + ".txt")
oxts_path_tracking = os.path.join(os.path.join(tracking_path, "oxts"), scene_id + ".txt")
tracklet_path = os.path.join(os.path.join(tracking_path, "label_02"), scene_id + ".txt")
tracking_calibration = tracking_calib_from_txt(calibration_path)
focal_X = tracking_calibration["P2"][0, 0]
focal_Y = tracking_calibration["P2"][1, 1]
poses_imu_w_tracking, _, _ = get_poses_calibration(basedir, oxts_path_tracking) # (n_frames, 4, 4) imu pose
tr_imu2velo = tracking_calibration["Tr_imu2velo"]
tr_velo2imu = invert_transformation(tr_imu2velo[:3, :3], tr_imu2velo[:3, 3])
poses_velo_w_tracking = np.matmul(poses_imu_w_tracking, tr_velo2imu) # (n_frames, 4, 4) velodyne pose
if self.use_semantic:
semantics = pd.read_csv(
os.path.join(self.semantic_path, "colors", scene_id + ".txt"),
sep=" ",
index_col=False,
)
if self.use_semantic:
semantics = semantics.loc[~semantics["Category"].isin(self.config.semantic_mask_classes)]
semantic_meta = Semantics(
filenames=[],
classes=semantics["Category"].tolist(),
colors=torch.tensor(semantics.iloc[:, 1:].values),
mask_classes=self.config.semantic_mask_classes,
)
# Get camera Poses camare id: 02, 03
for cam_i in range(2):
transformation = np.eye(4)
projection = tracking_calibration["P" + str(cam_i + 2)] # rectified camera coordinate system -> image
K_inv = np.linalg.inv(projection[:3, :3])
R_t = projection[:3, 3]
t_crect2c = np.matmul(K_inv, R_t)
# t_crect2c = 1./projection[[0, 1, 2],[0, 1, 2]] * projection[:, 3]
transformation[:3, 3] = t_crect2c
tracking_calibration["Tr_camrect2cam0" + str(cam_i + 2)] = transformation
sequ_frames = self.selected_frames
cam_poses_tracking = get_camera_poses_tracking(
poses_velo_w_tracking, tracking_calibration, sequ_frames, kitti_scene_no
)
# cam_poses_tracking[..., :3, 3] *= self.scale_factor
# Orients and centers the poses
oriented = torch.from_numpy(np.array(cam_poses_tracking).astype(np.float32)) # (n_frames, 3, 4)
oriented, transform_matrix = camera_utils.auto_orient_and_center_poses(
oriented
) # oriented (n_frames, 3, 4), transform_matrix (3, 4)
row = torch.tensor([0, 0, 0, 1], dtype=torch.float32)
zeros = torch.zeros(oriented.shape[0], 1, 4)
oriented = torch.cat([oriented, zeros], dim=1)
oriented[:, -1] = row # (n_frames, 4, 4)
transform_matrix = torch.cat([transform_matrix, row[None, :]], dim=0) # (4, 4)
cam_poses_tracking = oriented.numpy()
transform_matrix = transform_matrix.numpy()
image_filenames, depth_name, semantic_name, normal_name = get_scene_images_tracking(
tracking_path,
scene_id,
sequ_frames,
self.config.use_depth,
self.use_semantic,
self.semantic_path,
self.pred_normals,
)
# Get Object poses
instances_obj = get_obj_pose_tracking(
tracklet_path, poses_imu_w_tracking, tracking_calibration, sequ_frames, transform_matrix
)
# deleted visible_objects_, objects_meta_,
# # Align Axis with vkitti axis
poses = np.matmul(kitti2vkitti, cam_poses_tracking).astype(np.float32)
poses[..., :3, 3] *= self.scale_factor
instances_obj.centers[:, :, 2] *= -1
instances_obj.centers[:, :, [0, 1, 2]] = instances_obj.centers[:, :, [0, 2, 1]]
instances_obj.centers *= self.scale_factor
instances_obj.width *= (
self.config.box_scale * self.scale_factor
) # scale up the bbox to include shadows, etc.; scale bbox according to the scene scale factor
instances_obj.length *= self.config.box_scale * self.scale_factor
instances_obj.height *= self.scale_factor # height is not multiplied by box scale
N_obj = instances_obj.num_instances
counts = np.arange(instances_obj.len_tracklet * 2).reshape(2, -1)
frame_timestamps = np.concatenate(
[np.arange(instances_obj.len_tracklet), np.arange(instances_obj.len_tracklet)]
)
i_test = np.array([(idx + 1) % 4 == 0 for idx in counts[0]])
i_test = np.concatenate((i_test, i_test))
if self.config.split_setting == "reconstruction":
i_train = np.ones(instances_obj.len_tracklet * 2, dtype=bool)
elif self.config.split_setting == "nvs-75":
i_train = ~i_test
elif self.config.split_setting == "nvs-50":
i_train = np.array([(idx + 1) % 4 > 1 for idx in counts[0]])
i_train = np.concatenate((i_train, i_train))
elif self.config.split_setting == "nvs-25":
i_train = np.array([idx % 4 == 0 for idx in counts[0]])
i_train = np.concatenate((i_train, i_train))
else:
raise ValueError("No such split method")
counts = counts.reshape(-1)
i_train = counts[i_train]
train_timestamps = frame_timestamps[i_train]
i_test = counts[i_test]
test_timestamps = frame_timestamps[i_test]
test_load_image = imageio.imread(image_filenames[0])
image_height, image_width = test_load_image.shape[:2]
cx, cy = image_width / 2.0, image_height / 2.0
if split == "train":
indices = i_train
timestamps = train_timestamps
elif split == "val":
indices = i_test
timestamps = test_timestamps
elif split == "test":
indices = i_test
timestamps = test_timestamps
else:
raise ValueError(f"Unknown dataparser split {split}")
image_filenames = [image_filenames[i] for i in indices]
depth_filenames = [depth_name[i] for i in indices] if self.config.use_depth else None
normal_filenames = [normal_name[i] for i in indices] if self.pred_normals else None
if self.use_semantic:
semantic_meta.filenames = [semantic_name[i] for i in indices]
poses = poses[indices]
if self.config.use_car_latents:
if not self.config.car_object_latents_path.exists():
CONSOLE.print("[yello]Error: latents not exist")
exit()
car_latents = torch.load(str(self.config.car_object_latents_path))
track_car_latents = {}
track_car_latents_mean = {}
for k, idx in enumerate(car_latents["indices"]):
if sequ_frames[0] <= idx["fid"] <= sequ_frames[1]:
if idx["oid"] in track_car_latents.keys():
track_car_latents[idx["oid"]] = torch.cat(
[track_car_latents[idx["oid"]], car_latents["latents"][k].unsqueeze(-1)], dim=-1
)
else:
track_car_latents[idx["oid"]] = car_latents["latents"][k].unsqueeze(-1)
for k in track_car_latents.keys():
track_car_latents_mean[k] = track_car_latents[k][..., -1]
else:
car_latents = None
aabb_scale = self.config.scene_scale
scene_box = SceneBox(
aabb=torch.tensor(
[[-aabb_scale, -aabb_scale, -aabb_scale], [aabb_scale, aabb_scale, aabb_scale]], dtype=torch.float32
)
)
cameras = Cameras(
camera_to_worlds=torch.from_numpy(poses[:, :3, :4]),
fx=focal_X,
fy=focal_Y,
cx=cx,
cy=cy,
camera_type=CameraType.PERSPECTIVE,
height=image_height,
width=image_width,
)
dataparser_outputs = DataparserOutputs(
image_filenames=image_filenames,
cameras=cameras,
alpha_color=alpha_color_tensor,
scene_box=scene_box,
mask_filenames=None,
dataparser_scale=self.scale_factor,
metadata={
"depth_filenames": depth_filenames,
"obj_class": instances_obj.unique_category_ids,
"scale_factor": self.scale_factor,
"semantics": semantic_meta,
"normal_filenames": normal_filenames,
"instances": instances_obj,
"frame_timestamps": timestamps,
},
)
if self.config.use_car_latents:
dataparser_outputs.metadata.update(
{
"car_latents": track_car_latents_mean,
"car_nerf_state_dict_path": self.config.car_nerf_state_dict_path,
}
)
print("finished data parsing")
return dataparser_outputs
KittiParserSpec = DataParserSpecification(config=MarsKittiDataParserConfig())
Here's a refactored dataparser for the KITTI dataset. However, some minor changes should be made to plug it into the current mars release version. Hope this could help you guys with the memory issue. This script will be in some future release version, stay tuned~
from __future__ import annotations import os from dataclasses import dataclass, field from pathlib import Path from typing import List, Optional, Type import imageio import numpy as np import pandas as pd import torch from rich.console import Console from mars.configs.storage_configs.base_storage_config import StorageConfig from mars.instances.instances import Instances from mars.utils.neural_scene_graph_helper import box_pts from nerfstudio.cameras import camera_utils from nerfstudio.cameras.cameras import Cameras, CameraType from nerfstudio.data.dataparsers.base_dataparser import ( DataParser, DataParserConfig, DataparserOutputs, Semantics, ) from nerfstudio.data.scene_box import SceneBox from nerfstudio.plugins.registry_dataparser import DataParserSpecification from nerfstudio.utils.colors import get_color CONSOLE = Console(width=120) _sem2label = {"Misc": -1, "Car": 0, "Van": 0, "Truck": 2, "Tram": 3, "Pedestrian": 4} camera_ls = [2, 3] def kitti_string_to_float(str): return float(str.split("e")[0]) * 10 ** int(str.split("e")[1]) def get_rotation(roll, pitch, heading): s_heading = np.sin(heading) c_heading = np.cos(heading) rot_z = np.array([[c_heading, -s_heading, 0], [s_heading, c_heading, 0], [0, 0, 1]]) s_pitch = np.sin(pitch) c_pitch = np.cos(pitch) rot_y = np.array([[c_pitch, 0, s_pitch], [0, 1, 0], [-s_pitch, 0, c_pitch]]) s_roll = np.sin(roll) c_roll = np.cos(roll) rot_x = np.array([[1, 0, 0], [0, c_roll, -s_roll], [0, s_roll, c_roll]]) rot = np.matmul(rot_z, np.matmul(rot_y, rot_x)) return rot def invert_transformation(rot, t): t = np.matmul(-rot.T, t) inv_translation = np.concatenate([rot.T, t[:, None]], axis=1) return np.concatenate([inv_translation, np.array([[0.0, 0.0, 0.0, 1.0]])]) def calib_from_txt(calibration_path): """ Read the calibration files and extract the required transformation matrices and focal length. Args: calibration_path (str): The path to the directory containing the calibration files. Returns: tuple: A tuple containing the following elements: traimu2v (np.array): 4x4 transformation matrix from IMU to Velodyne coordinates. v2c (np.array): 4x4 transformation matrix from Velodyne to left camera coordinates. c2leftRGB (np.array): 4x4 transformation matrix from left camera to rectified left camera coordinates. c2rightRGB (np.array): 4x4 transformation matrix from right camera to rectified right camera coordinates. focal (float): Focal length of the left camera. """ c2c = [] # Read and parse the camera-to-camera calibration file f = open(os.path.join(calibration_path, "calib_cam_to_cam.txt"), "r") cam_to_cam_str = f.read() [left_cam, right_cam] = cam_to_cam_str.split("S_02: ")[1].split("S_03: ") cam_to_cam_ls = [left_cam, right_cam] # Extract the transformation matrices for left and right cameras for i, cam_str in enumerate(cam_to_cam_ls): r_str, t_str = cam_str.split("R_0" + str(i + 2) + ": ")[1].split("\nT_0" + str(i + 2) + ": ") t_str = t_str.split("\n")[0] R = np.array([kitti_string_to_float(r) for r in r_str.split(" ")]) R = np.reshape(R, [3, 3]) t = np.array([kitti_string_to_float(t) for t in t_str.split(" ")]) Tr = np.concatenate([np.concatenate([R, t[:, None]], axis=1), np.array([0.0, 0.0, 0.0, 1.0])[None, :]]) t_str_rect, s_rect_part = cam_str.split("\nT_0" + str(i + 2) + ": ")[1].split("\nS_rect_0" + str(i + 2) + ": ") s_rect_str, r_rect_part = s_rect_part.split("\nR_rect_0" + str(i + 2) + ": ") r_rect_str = r_rect_part.split("\nP_rect_0" + str(i + 2) + ": ")[0] R_rect = np.array([kitti_string_to_float(r) for r in r_rect_str.split(" ")]) R_rect = np.reshape(R_rect, [3, 3]) t_rect = np.array([kitti_string_to_float(t) for t in t_str_rect.split(" ")]) Tr_rect = np.concatenate( [np.concatenate([R_rect, t_rect[:, None]], axis=1), np.array([0.0, 0.0, 0.0, 1.0])[None, :]] ) c2c.append(Tr_rect) c2leftRGB = c2c[0] c2rightRGB = c2c[1] # Read and parse the Velodyne-to-camera calibration file f = open(os.path.join(calibration_path, "calib_velo_to_cam.txt"), "r") velo_to_cam_str = f.read() r_str, t_str = velo_to_cam_str.split("R: ")[1].split("\nT: ") t_str = t_str.split("\n")[0] R = np.array([kitti_string_to_float(r) for r in r_str.split(" ")]) R = np.reshape(R, [3, 3]) t = np.array([kitti_string_to_float(r) for r in t_str.split(" ")]) v2c = np.concatenate([np.concatenate([R, t[:, None]], axis=1), np.array([0.0, 0.0, 0.0, 1.0])[None, :]]) # Read and parse the IMU-to-Velodyne calibration file f = open(os.path.join(calibration_path, "calib_imu_to_velo.txt"), "r") imu_to_velo_str = f.read() r_str, t_str = imu_to_velo_str.split("R: ")[1].split("\nT: ") R = np.array([kitti_string_to_float(r) for r in r_str.split(" ")]) R = np.reshape(R, [3, 3]) t = np.array([kitti_string_to_float(r) for r in t_str.split(" ")]) imu2v = np.concatenate([np.concatenate([R, t[:, None]], axis=1), np.array([0.0, 0.0, 0.0, 1.0])[None, :]]) # Extract the focal length of the left camera focal = kitti_string_to_float(left_cam.split("P_rect_02: ")[1].split()[0]) return imu2v, v2c, c2leftRGB, c2rightRGB, focal def tracking_calib_from_txt(calibration_path): """ Extract tracking calibration information from a KITTI tracking calibration file. This function reads a KITTI tracking calibration file and extracts the relevant calibration information, including projection matrices and transformation matrices for camera, LiDAR, and IMU coordinate systems. Args: calibration_path (str): Path to the KITTI tracking calibration file. Returns: dict: A dictionary containing the following calibration information: P0, P1, P2, P3 (np.array): 3x4 projection matrices for the cameras. Tr_cam2camrect (np.array): 4x4 transformation matrix from camera to rectified camera coordinates. Tr_velo2cam (np.array): 4x4 transformation matrix from LiDAR to camera coordinates. Tr_imu2velo (np.array): 4x4 transformation matrix from IMU to LiDAR coordinates. """ # Read the calibration file f = open(calibration_path) calib_str = f.read().splitlines() # Process the calibration data calibs = [] for calibration in calib_str: calibs.append(np.array([kitti_string_to_float(val) for val in calibration.split()[1:]])) # Extract the projection matrices P0 = np.reshape(calibs[0], [3, 4]) P1 = np.reshape(calibs[1], [3, 4]) P2 = np.reshape(calibs[2], [3, 4]) P3 = np.reshape(calibs[3], [3, 4]) # Extract the transformation matrix for camera to rectified camera coordinates Tr_cam2camrect = np.eye(4) R_rect = np.reshape(calibs[4], [3, 3]) Tr_cam2camrect[:3, :3] = R_rect # Extract the transformation matrices for LiDAR to camera and IMU to LiDAR coordinates Tr_velo2cam = np.concatenate([np.reshape(calibs[5], [3, 4]), np.array([[0.0, 0.0, 0.0, 1.0]])], axis=0) Tr_imu2velo = np.concatenate([np.reshape(calibs[6], [3, 4]), np.array([[0.0, 0.0, 0.0, 1.0]])], axis=0) return { "P0": P0, "P1": P1, "P2": P2, "P3": P3, "Tr_cam2camrect": Tr_cam2camrect, "Tr_velo2cam": Tr_velo2cam, "Tr_imu2velo": Tr_imu2velo, } def get_poses_calibration(basedir, oxts_path_tracking=None, selected_frames=None): """ Extract poses and calibration information from the KITTI dataset. This function processes the OXTS data (GPS/IMU) and extracts the pose information (translation and rotation) for each frame. It also retrieves the calibration information (transformation matrices and focal length) required for further processing. Args: basedir (str): The base directory containing the KITTI dataset. oxts_path_tracking (str, optional): Path to the OXTS data file for tracking sequences. If not provided, the function will look for OXTS data in the basedir. selected_frames (list, optional): A list of frame indices to process. If not provided, all frames in the dataset will be processed. Returns: tuple: A tuple containing the following elements: poses (np.array): An array of 4x4 pose matrices representing the vehicle's position and orientation for each frame (IMU pose). calibrations (dict): A dictionary containing the transformation matrices and focal length obtained from the calibration files. focal (float): The focal length of the left camera. """ def oxts_to_pose(oxts): """ OXTS (Oxford Technical Solutions) data typically refers to the data generated by an Inertial and GPS Navigation System (INS/GPS) that is used to provide accurate position, orientation, and velocity information for a moving platform, such as a vehicle. In the context of the KITTI dataset, OXTS data is used to provide the ground truth for the vehicle's trajectory and 6 degrees of freedom (6-DoF) motion, which is essential for evaluating and benchmarking various computer vision and robotics algorithms, such as visual odometry, SLAM, and object detection. The OXTS data contains several important measurements: 1. Latitude, longitude, and altitude: These are the global coordinates of the moving platform. 2. Roll, pitch, and yaw (heading): These are the orientation angles of the platform, usually given in Euler angles. 3. Velocity (north, east, and down): These are the linear velocities of the platform in the local navigation frame. 4. Accelerations (ax, ay, az): These are the linear accelerations in the platform's body frame. 5. Angular rates (wx, wy, wz): These are the angular rates (also known as angular velocities) of the platform in its body frame. In the KITTI dataset, the OXTS data is stored as plain text files with each line corresponding to a timestamp. Each line in the file contains the aforementioned measurements, which are used to compute the ground truth trajectory and 6-DoF motion of the vehicle. This information can be further used for calibration, data synchronization, and performance evaluation of various algorithms. """ poses = [] def latlon_to_mercator(lat, lon, s): """ Converts latitude and longitude coordinates to Mercator coordinates (x, y) using the given scale factor. The Mercator projection is a widely used cylindrical map projection that represents the Earth's surface as a flat, rectangular grid, distorting the size of geographical features in higher latitudes. This function uses the scale factor 's' to control the amount of distortion in the projection. Args: lat (float): Latitude in degrees, range: -90 to 90. lon (float): Longitude in degrees, range: -180 to 180. s (float): Scale factor, typically the cosine of the reference latitude. Returns: list: A list containing the Mercator coordinates [x, y] in meters. """ r = 6378137.0 # the Earth's equatorial radius in meters x = s * r * ((np.pi * lon) / 180) y = s * r * np.log(np.tan((np.pi * (90 + lat)) / 360)) return [x, y] # Compute the initial scale and pose based on the selected frames if selected_frames is None: lat0 = oxts[0][0] scale = np.cos(lat0 * np.pi / 180) pose_0_inv = None else: oxts0 = oxts[selected_frames[0][0]] lat0 = oxts0[0] scale = np.cos(lat0 * np.pi / 180) pose_i = np.eye(4) [x, y] = latlon_to_mercator(oxts0[0], oxts0[1], scale) z = oxts0[2] translation = np.array([x, y, z]) rotation = get_rotation(oxts0[3], oxts0[4], oxts0[5]) pose_i[:3, :] = np.concatenate([rotation, translation[:, None]], axis=1) pose_0_inv = invert_transformation(pose_i[:3, :3], pose_i[:3, 3]) # Iterate through the OXTS data and compute the corresponding pose matrices for oxts_val in oxts: pose_i = np.zeros([4, 4]) pose_i[3, 3] = 1 [x, y] = latlon_to_mercator(oxts_val[0], oxts_val[1], scale) z = oxts_val[2] translation = np.array([x, y, z]) roll = oxts_val[3] pitch = oxts_val[4] heading = oxts_val[5] rotation = get_rotation(roll, pitch, heading) # (3,3) pose_i[:3, :] = np.concatenate([rotation, translation[:, None]], axis=1) # (4, 4) if pose_0_inv is None: pose_0_inv = invert_transformation(pose_i[:3, :3], pose_i[:3, 3]) pose_i = np.matmul(pose_0_inv, pose_i) poses.append(pose_i) return np.array(poses) # If there is no tracking path specified, use the default path if oxts_path_tracking is None: oxts_path = os.path.join(basedir, "oxts/data") oxts = np.array([np.loadtxt(os.path.join(oxts_path, file)) for file in sorted(os.listdir(oxts_path))]) calibration_path = os.path.dirname(basedir) calibrations = calib_from_txt(calibration_path) focal = calibrations[4] poses = oxts_to_pose(oxts) # If a tracking path is specified, use it to load OXTS data and compute the poses else: oxts_tracking = np.loadtxt(oxts_path_tracking) poses = oxts_to_pose(oxts_tracking) # (n_frames, 4, 4) calibrations = None focal = None # Set velodyne close to z = 0 # poses[:, 2, 3] -= 0.8 # Return the poses, calibrations, and focal length return poses, calibrations, focal def get_camera_poses_tracking(poses_velo_w_tracking, tracking_calibration, selected_frames, scene_no=None): exp = False camera_poses = [] opengl2kitti = np.array([[1, 0, 0, 0], [0, -1, 0, 0], [0, 0, -1, 0], [0, 0, 0, 1]]) start_frame = selected_frames[0] end_frame = selected_frames[1] ##################### # Debug Camera offset if scene_no == 2: yaw = np.deg2rad(0.7) ## Affects camera rig roll: High --> counterclockwise pitch = np.deg2rad(-0.5) ## Affects camera rig yaw: High --> Turn Right # pitch = np.deg2rad(-0.97) roll = np.deg2rad(0.9) ## Affects camera rig pitch: High --> up # roll = np.deg2rad(1.2) elif scene_no == 1: if exp: yaw = np.deg2rad(0.3) ## Affects camera rig roll: High --> counterclockwise pitch = np.deg2rad(-0.6) ## Affects camera rig yaw: High --> Turn Right # pitch = np.deg2rad(-0.97) roll = np.deg2rad(0.75) ## Affects camera rig pitch: High --> up # roll = np.deg2rad(1.2) else: yaw = np.deg2rad(0.5) ## Affects camera rig roll: High --> counterclockwise pitch = np.deg2rad(-0.5) ## Affects camera rig yaw: High --> Turn Right roll = np.deg2rad(0.75) ## Affects camera rig pitch: High --> up else: yaw = np.deg2rad(0.05) pitch = np.deg2rad(-0.75) # pitch = np.deg2rad(-0.97) roll = np.deg2rad(1.05) # roll = np.deg2rad(1.2) cam_debug = np.eye(4) cam_debug[:3, :3] = get_rotation(roll, pitch, yaw) Tr_cam2camrect = tracking_calibration["Tr_cam2camrect"] Tr_cam2camrect = np.matmul(Tr_cam2camrect, cam_debug) Tr_camrect2cam = invert_transformation(Tr_cam2camrect[:3, :3], Tr_cam2camrect[:3, 3]) Tr_velo2cam = tracking_calibration["Tr_velo2cam"] Tr_cam2velo = invert_transformation(Tr_velo2cam[:3, :3], Tr_velo2cam[:3, 3]) camera_poses_imu = [] for cam in camera_ls: Tr_camrect2cam_i = tracking_calibration["Tr_camrect2cam0" + str(cam)] Tr_cam_i2camrect = invert_transformation(Tr_camrect2cam_i[:3, :3], Tr_camrect2cam_i[:3, 3]) # transform camera axis from kitti to opengl for nerf: cam_i_camrect = np.matmul(Tr_cam_i2camrect, opengl2kitti) cam_i_cam0 = np.matmul(Tr_camrect2cam, cam_i_camrect) cam_i_velo = np.matmul(Tr_cam2velo, cam_i_cam0) cam_i_w = np.matmul(poses_velo_w_tracking, cam_i_velo) camera_poses_imu.append(cam_i_w) for i, cam in enumerate(camera_ls): for frame_no in range(start_frame, end_frame + 1): camera_poses.append(camera_poses_imu[i][frame_no]) return np.array(camera_poses) def get_obj_pose_tracking(tracklet_path, poses_imu_tracking, calibrations, selected_frames, transform_matrix): """ Extracts object pose information from the KITTI motion tracking dataset for the specified frames. Parameters ---------- tracklet_path : str Path to the text file containing tracklet information. A tracklet is a small sequence of object positions and orientations over time, often used in the context of object tracking and motion estimation in computer vision. In a dataset, a tracklet usually represents a single object's pose information across multiple consecutive frames. This information includes the object's position, orientation (usually as rotation around the vertical axis, i.e., yaw angle), and other attributes like object type, dimensions, etc. In the KITTI dataset, tracklets are used to store and provide ground truth information about dynamic objects in the scene, such as cars, pedestrians, and cyclists. poses_imu_tracking : list of numpy arrays A list of 4x4 transformation matrices representing the poses (positions and orientations) of the ego vehicle A(the main vehicle equipped with sensors) in Inertial Measurement Unit (IMU) coordinates at different time instances (frames). Each matrix in the list corresponds to a single frame in the dataset. calibrations : dict Dictionary containing calibration information: - "Tr_velo2cam": 3x4 transformation matrix from Velodyne coordinates to camera coordinates. - "Tr_imu2velo": 3x4 transformation matrix from IMU coordinates to Velodyne coordinates. selected_frames : list of int List of two integers specifying the start and end frames to process. Returns ------- Object Instance instance. """ # Helper function to generate a rotation matrix around the y-axis def roty_matrix(roty): c = np.cos(roty) s = np.sin(roty) return np.array([[c, 0, s], [0, 1, 0], [-s, 0, c]]) # Extract calibration data velo2cam = calibrations["Tr_velo2cam"] imu2velo = calibrations["Tr_imu2velo"] cam2velo = invert_transformation(velo2cam[:3, :3], velo2cam[:3, 3]) velo2imu = invert_transformation(imu2velo[:3, :3], imu2velo[:3, 3]) start_frame = selected_frames[0] end_frame = selected_frames[1] # Read tracklets from file f = open(tracklet_path) tracklets_str = f.read().splitlines() track_ids, category_ids, lengths, heights, widths = [], [], [], [], [] track2index = {} centers_list = [] yaws_list = [] timestamps_list = [] # Extract metadata for all objects in the scene for tracklet in tracklets_str: tracklet = tracklet.split() if float(tracklet[1]) < 0: continue frame_id = int(tracklet[0]) if frame_id < start_frame or frame_id > end_frame: continue track_id = int(tracklet[1]) if not tracklet[2] in _sem2label: continue type = _sem2label[tracklet[2]] if not track_id in track2index: # extract metadata height = float(tracklet[10]) width = float(tracklet[11]) length = float(tracklet[12]) track2index[track_id] = len(track_ids) track_ids.append(track_id) category_ids.append(type) heights.append(height) widths.append(width) lengths.append(length) # create empty tensors for centers and yaws centers_list.append(torch.zeros((end_frame - start_frame + 1, 3), dtype=torch.float64)) yaws_list.append(torch.zeros((end_frame - start_frame + 1, 1), dtype=torch.float64)) timestamps_list.append(-1.0 * torch.ones((end_frame - start_frame + 1, 1), dtype=torch.float64)) # extract object track index = track2index[track_id] timestamps_list[index][frame_id - start_frame] = float(frame_id - start_frame) # Initialize a 4x4 identity matrix for object pose in camera coordinates pose = np.array( [float(tracklet[13]), float(tracklet[14]), float(tracklet[15]), float(tracklet[16])] ) # x,y,z,yaw obj_pose_c = np.eye(4) obj_pose_c[:3, 3] = pose[:3] roty = pose[3] obj_pose_c[:3, :3] = roty_matrix(roty) # Transform object pose from camera coordinates to IMU coordinates obj_pose_imu = np.matmul(velo2imu, np.matmul(cam2velo, obj_pose_c)) # Get the IMU pose for the corresponding frame pose_imu_w_frame_i = poses_imu_tracking[int(frame_id)] # Calculate the world pose of the object pose_obj_w_i = np.matmul(pose_imu_w_frame_i, obj_pose_imu) pose_obj_w_i = np.matmul(transform_matrix, pose_obj_w_i) # pose_obj_w_i[:, 3] *= scale_factor # Calculate the approximate yaw angle of the object in the world frame yaw_aprox = -np.arctan2(pose_obj_w_i[1, 0], pose_obj_w_i[0, 0]) # store center and yaw centers_list[index][frame_id - start_frame] = torch.from_numpy(pose_obj_w_i[:3, 3]) yaws_list[index][frame_id - start_frame] = yaw_aprox max_obj_per_frame = len(track_ids) instances_obj = Instances( width=torch.tensor(widths), height=torch.tensor(heights), length=torch.tensor(lengths), centers=torch.stack(centers_list, dim=0), yaws=torch.stack(yaws_list, dim=0), track_ids=torch.tensor(track_ids), category_ids=torch.tensor(category_ids), timestamps=torch.stack(timestamps_list, dim=0), valid_mask=torch.stack(timestamps_list) > 0, ) return instances_obj def get_scene_images_tracking( tracking_path, sequence, selected_frames, use_depth=False, use_semantic=False, semantic_path=None, pred_normals=False, ): [start_frame, end_frame] = selected_frames # imgs = [] img_name = [] depth_name = [] semantic_name = [] normal_name = [] left_img_path = os.path.join(os.path.join(tracking_path, "image_02"), sequence) right_img_path = os.path.join(os.path.join(tracking_path, "image_03"), sequence) left_normal_path = os.path.join(os.path.join(tracking_path, "normal_02"), sequence) right_normal_path = os.path.join(os.path.join(tracking_path, "normal_03"), sequence) if pred_normals: for frame_dir in [left_normal_path, right_normal_path]: for frame_no in range(len(os.listdir(left_normal_path))): if start_frame <= frame_no <= end_frame: frame = sorted(os.listdir(frame_dir))[frame_no] fname = os.path.join(frame_dir, frame) normal_name.append(fname) if use_depth: left_depth_path = os.path.join(os.path.join(tracking_path, "completion_02"), sequence) right_depth_path = os.path.join(os.path.join(tracking_path, "completion_03"), sequence) for frame_dir in [left_img_path, right_img_path]: for frame_no in range(len(os.listdir(left_img_path))): if start_frame <= frame_no <= end_frame: frame = sorted(os.listdir(frame_dir))[frame_no] fname = os.path.join(frame_dir, frame) # imgs.append(imageio.imread(fname)) img_name.append(fname) if use_depth: for frame_dir in [left_depth_path, right_depth_path]: for frame_no in range(len(os.listdir(left_depth_path))): if start_frame <= frame_no <= end_frame: frame = sorted(os.listdir(frame_dir))[frame_no] fname = os.path.join(frame_dir, frame) depth_name.append(fname) if use_semantic: frame_dir = os.path.join(semantic_path, "train", sequence) for _ in range(2): for frame_no in range(len(os.listdir(frame_dir))): if start_frame <= frame_no <= end_frame: frame = sorted(os.listdir(frame_dir))[frame_no] fname = os.path.join(frame_dir, frame) semantic_name.append(fname) # imgs = (np.maximum(np.minimum(np.array(imgs), 255), 0) / 255.0).astype(np.float32) return img_name, depth_name, semantic_name, normal_name def get_rays_np(H, W, focal, c2w): """Get ray origins, directions from a pinhole camera.""" # Numpy Version i, j = np.meshgrid(np.arange(W, dtype=np.float32), np.arange(H, dtype=np.float32), indexing="xy") dirs = np.stack([(i - W * 0.5) / focal, -(j - H * 0.5) / focal, -np.ones_like(i)], -1) rays_d = np.sum(dirs[..., np.newaxis, :] * c2w[:3, :3], -1) rays_o = np.broadcast_to(c2w[:3, -1], np.shape(rays_d)) return rays_o, rays_d def rotate_yaw(p, yaw): """Rotates p with yaw in the given coord frame with y being the relevant axis and pointing downwards Args: p: 3D points in a given frame [N_pts, N_frames, 3]/[N_pts, N_frames, N_samples, 3] yaw: Rotation angle Returns: p: Rotated points [N_pts, N_frames, N_samples, 3] """ # p of size [batch_rays, n_obj, samples, xyz] if len(p.shape) < 4: # p = p[..., tf.newaxis, :] p = p.unsqueeze(-2) c_y = torch.cos(yaw) s_y = torch.sin(yaw) if len(c_y.shape) < 3: c_y = c_y.unsqueeze(-1) s_y = s_y.unsqueeze(-1) # c_y = tf.cos(yaw)[..., tf.newaxis] # s_y = tf.sin(yaw)[..., tf.newaxis] p_x = c_y * p[..., 0] - s_y * p[..., 2] p_y = p[..., 1] p_z = s_y * p[..., 0] + c_y * p[..., 2] # return tf.concat([p_x[..., tf.newaxis], p_y[..., tf.newaxis], p_z[..., tf.newaxis]], axis=-1) return torch.cat([p_x.unsqueeze(-1), p_y.unsqueeze(-1), p_z.unsqueeze(-1)], dim=-1) def scale_frames(p, sc_factor, inverse=False): """Scales points given in N_frames in each dimension [xyz] for each frame or rescales for inverse==True Args: p: Points given in N_frames frames [N_points, N_frames, N_samples, 3] sc_factor: Scaling factor for new frame [N_points, N_frames, 3] inverse: Inverse scaling if true, bool Returns: p_scaled: Points given in N_frames rescaled frames [N_points, N_frames, N_samples, 3] """ # Take 150% of bbox to include shadows etc. dim = torch.tensor([1.0, 1.0, 1.0]) * sc_factor # dim = tf.constant([0.1, 0.1, 0.1]) * sc_factor half_dim = dim / 2 scaling_factor = (1 / (half_dim + 1e-9)).unsqueeze(-2) # scaling_factor = (1 / (half_dim + 1e-9))[:, :, tf.newaxis, :] if not inverse: p_scaled = scaling_factor * p else: p_scaled = (1.0 / scaling_factor) * p return p_scaled def get_all_ray_3dbox_intersection(rays_rgb, obj_meta_tensor, chunk, local=False, obj_to_remove=-100): """get all rays hitting an oject given 3D multi-object-tracking results of a sequence Args: rays_rgb: All rays obj_meta_tensor: Metadata of all objects chunk: No. of rays processed at the same time local: Limit used memory if processed on a local machine with limited CPU/GPU resources obj_to_remove: If object should be removed from the set of rays Returns: rays_on_obj: Set of all rays hitting at least one object rays_to_remove: Set of all rays hitting an object, that should not be trained """ print("Removing object ", obj_to_remove) rays_on_obj = np.array([]) rays_to_remove = np.array([]) _batch_sz_inter = chunk if not local else 5000 # args.chunk _only_intersect_rays_rgb = rays_rgb[0][None] _n_rays = rays_rgb.shape[0] _n_obj = (rays_rgb.shape[1] - 3) // 2 _n_bt = np.ceil(_n_rays / _batch_sz_inter).astype(np.int32) for i in range(_n_bt): _tf_rays_rgb = torch.from_numpy(rays_rgb[i * _batch_sz_inter : (i + 1) * _batch_sz_inter]).to(torch.float32) # _tf_rays_rgb = tf.cast(rays_rgb[i * _batch_sz_inter:(i + 1) * _batch_sz_inter], tf.float32) _n_bt_i = _tf_rays_rgb.shape[0] _rays_bt = [_tf_rays_rgb[:, 0, :], _tf_rays_rgb[:, 1, :]] _objs = torch.reshape(_tf_rays_rgb[:, 3:, :], (_n_bt_i, _n_obj, 6)) # _objs = tf.reshape(_tf_rays_rgb[:, 3:, :], [_n_bt_i, _n_obj, 6]) _obj_pose = _objs[..., :3] _obj_theta = _objs[..., 3] _obj_id = _objs[..., 4].to(torch.int64) # _obj_id = tf.cast(_objs[..., 4], tf.int32) _obj_meta = torch.index_select(obj_meta_tensor, 0, _obj_id.reshape(-1)).reshape( -1, _obj_id.shape[1], obj_meta_tensor.shape[1] ) # _obj_meta = tf.gather(obj_meta_tensor, _obj_id, axis=0) _obj_track_id = _obj_meta[..., 0].unsqueeze(-1) _obj_dim = _obj_meta[..., 1:4] box_points_insters = box_pts(_rays_bt, _obj_pose, _obj_theta, _obj_dim, one_intersec_per_ray=False) _mask = box_points_insters[8] if _mask is not None: if rays_on_obj.any(): rays_on_obj = np.concatenate([rays_on_obj, np.array(i * _batch_sz_inter + (_mask[:, 0]).cpu().numpy())]) else: rays_on_obj = np.array(i * _batch_sz_inter + _mask[:, 0].cpu().numpy()) if obj_to_remove is not None: _hit_id = _obj_track_id[_mask] import pdb pdb.set_trace() # _hit_id = tf.gather_nd(_obj_track_id, _mask) # bool_remove = tf.equal(_hit_id, obj_to_remove) bool_remove = np.equal(_hit_id, obj_to_remove) if any(bool_remove): # _remove_mask = tf.gather_nd(_mask, tf.where(bool_remove)) _remove_mask = np.array(_mask[:, 0])[np.where(np.equal(_hit_id, obj_to_remove))[0]] if rays_to_remove.any(): rays_to_remove = np.concatenate([rays_to_remove, np.array(i * _batch_sz_inter + _remove_mask)]) else: rays_to_remove = np.array(i * _batch_sz_inter + _remove_mask) return rays_on_obj, rays_to_remove, box_points_insters @dataclass class MarsKittiDataParserConfig(DataParserConfig): """nerual scene graph dataset parser config""" _target: Type = field(default_factory=lambda: MarsKittiDataparser) """target class to instantiate""" data: Path = Path("data/kitti/training/image_02/0005") """Directory specifying location of data.""" sequence_id: str = "0006" """Sequence ID of the KITTI data.""" scale_factor: float = 1 """How much to scale the camera origins by.""" scene_scale: float = 1.0 """How much to scale the region of interest by.""" alpha_color: str = "white" """alpha color of background""" first_frame: int = 65 """specifies the beginning of a sequence if not the complete scene is taken as Input""" last_frame: int = 120 """specifies the end of a sequence""" box_scale: float = 1.5 """Maximum scale for bboxes to include shadows""" near_plane: float = 0.5 """specifies the distance from the last pose to the near plane""" far_plane: float = 150.0 """specifies the distance from the last pose to the far plane""" netchunk: int = 1024 * 64 """number of pts sent through network in parallel, decrease if running out of memory""" chunk: int = 1024 * 32 """number of rays processed in parallel, decrease if running out of memory""" use_car_latents: bool = True car_object_latents_path: Optional[Path] = Path(f"pretrain/car_nerf/kitti/latent_codes{sequence_id[-2:]}.pt") """path of car object latent codes""" car_nerf_state_dict_path: Optional[Path] = Path("pretrain/car_nerf/kitti/epoch_670.ckpt") """path of car nerf state dicts""" use_depth: bool = True """whether the training loop contains depth""" split_setting: str = "reconstruction" use_semantic: bool = False """whether to use semantic information""" semantic_path: Optional[Path] = Path("") """path of semantic inputs""" semantic_mask_classes: List[str] = field(default_factory=lambda: []) """semantic classes that do not generate gradient to the background model""" storage_config: StorageConfig = StorageConfig() """machine specific configs.""" pred_normals: bool = False """whether to use normals information""" @dataclass class MarsKittiDataparser(DataParser): """nerual scene graph kitti Dataset""" config: MarsKittiDataParserConfig def __init__(self, config: MarsKittiDataParserConfig): super().__init__(config=config) self.data: Path = config.data self.scale_factor: float = config.scale_factor self.alpha_color = config.alpha_color self.selected_frames = [config.first_frame, config.last_frame] self.use_time = False self.remove = -1 self.near = config.near_plane self.far = config.far_plane self.netchunk = config.netchunk self.chunk = config.chunk self.use_semantic = config.use_semantic self.semantic_path: Path = config.storage_config.datapath_dict["KITTI-MOT-home"] / "panoptic_maps" self.pred_normals = config.pred_normals if "KITTI-MOT-home" in self.config.storage_config.datapath_dict: self.data = ( self.config.storage_config.datapath_dict["KITTI-MOT-home"] / "training" / "image_02" / self.config.sequence_id ) self.config.data = self.data assert self.config.data.exists() if "CarNeRF-latents" in self.config.storage_config.datapath_dict: self.config.car_object_latents_path = ( self.config.storage_config.datapath_dict["CarNeRF-latents"] / "kitti_mot" / "latents" / f"latent_codes{self.config.sequence_id[-2:]}.pt" ) assert self.config.car_object_latents_path.exists() if "CarNeRF-pretrained-model" in self.config.storage_config.datapath_dict: self.config.car_nerf_state_dict_path = ( self.config.storage_config.datapath_dict["CarNeRF-pretrained-model"] / "epoch_670.ckpt" ) assert self.config.car_nerf_state_dict_path.exists() def _generate_dataparser_outputs(self, split="train"): visible_objects_ls = [] objects_meta_ls = [] semantic_meta = [] kitti2vkitti = np.array( [[1.0, 0.0, 0.0, 0.0], [0.0, 0.0, -1.0, 0.0], [0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 0.0, 1.0]] ) if self.alpha_color is not None: alpha_color_tensor = get_color(self.alpha_color) else: alpha_color_tensor = None basedir = str(self.data) scene_id = basedir[-4:] # check kitti_scene_no = int(scene_id) tracking_path = basedir[:-13] # check calibration_path = os.path.join(os.path.join(tracking_path, "calib"), scene_id + ".txt") oxts_path_tracking = os.path.join(os.path.join(tracking_path, "oxts"), scene_id + ".txt") tracklet_path = os.path.join(os.path.join(tracking_path, "label_02"), scene_id + ".txt") tracking_calibration = tracking_calib_from_txt(calibration_path) focal_X = tracking_calibration["P2"][0, 0] focal_Y = tracking_calibration["P2"][1, 1] poses_imu_w_tracking, _, _ = get_poses_calibration(basedir, oxts_path_tracking) # (n_frames, 4, 4) imu pose tr_imu2velo = tracking_calibration["Tr_imu2velo"] tr_velo2imu = invert_transformation(tr_imu2velo[:3, :3], tr_imu2velo[:3, 3]) poses_velo_w_tracking = np.matmul(poses_imu_w_tracking, tr_velo2imu) # (n_frames, 4, 4) velodyne pose if self.use_semantic: semantics = pd.read_csv( os.path.join(self.semantic_path, "colors", scene_id + ".txt"), sep=" ", index_col=False, ) if self.use_semantic: semantics = semantics.loc[~semantics["Category"].isin(self.config.semantic_mask_classes)] semantic_meta = Semantics( filenames=[], classes=semantics["Category"].tolist(), colors=torch.tensor(semantics.iloc[:, 1:].values), mask_classes=self.config.semantic_mask_classes, ) # Get camera Poses camare id: 02, 03 for cam_i in range(2): transformation = np.eye(4) projection = tracking_calibration["P" + str(cam_i + 2)] # rectified camera coordinate system -> image K_inv = np.linalg.inv(projection[:3, :3]) R_t = projection[:3, 3] t_crect2c = np.matmul(K_inv, R_t) # t_crect2c = 1./projection[[0, 1, 2],[0, 1, 2]] * projection[:, 3] transformation[:3, 3] = t_crect2c tracking_calibration["Tr_camrect2cam0" + str(cam_i + 2)] = transformation sequ_frames = self.selected_frames cam_poses_tracking = get_camera_poses_tracking( poses_velo_w_tracking, tracking_calibration, sequ_frames, kitti_scene_no ) # cam_poses_tracking[..., :3, 3] *= self.scale_factor # Orients and centers the poses oriented = torch.from_numpy(np.array(cam_poses_tracking).astype(np.float32)) # (n_frames, 3, 4) oriented, transform_matrix = camera_utils.auto_orient_and_center_poses( oriented ) # oriented (n_frames, 3, 4), transform_matrix (3, 4) row = torch.tensor([0, 0, 0, 1], dtype=torch.float32) zeros = torch.zeros(oriented.shape[0], 1, 4) oriented = torch.cat([oriented, zeros], dim=1) oriented[:, -1] = row # (n_frames, 4, 4) transform_matrix = torch.cat([transform_matrix, row[None, :]], dim=0) # (4, 4) cam_poses_tracking = oriented.numpy() transform_matrix = transform_matrix.numpy() image_filenames, depth_name, semantic_name, normal_name = get_scene_images_tracking( tracking_path, scene_id, sequ_frames, self.config.use_depth, self.use_semantic, self.semantic_path, self.pred_normals, ) # Get Object poses instances_obj = get_obj_pose_tracking( tracklet_path, poses_imu_w_tracking, tracking_calibration, sequ_frames, transform_matrix ) # deleted visible_objects_, objects_meta_, # # Align Axis with vkitti axis poses = np.matmul(kitti2vkitti, cam_poses_tracking).astype(np.float32) poses[..., :3, 3] *= self.scale_factor instances_obj.centers[:, :, 2] *= -1 instances_obj.centers[:, :, [0, 1, 2]] = instances_obj.centers[:, :, [0, 2, 1]] instances_obj.centers *= self.scale_factor instances_obj.width *= ( self.config.box_scale * self.scale_factor ) # scale up the bbox to include shadows, etc.; scale bbox according to the scene scale factor instances_obj.length *= self.config.box_scale * self.scale_factor instances_obj.height *= self.scale_factor # height is not multiplied by box scale N_obj = instances_obj.num_instances counts = np.arange(instances_obj.len_tracklet * 2).reshape(2, -1) frame_timestamps = np.concatenate( [np.arange(instances_obj.len_tracklet), np.arange(instances_obj.len_tracklet)] ) i_test = np.array([(idx + 1) % 4 == 0 for idx in counts[0]]) i_test = np.concatenate((i_test, i_test)) if self.config.split_setting == "reconstruction": i_train = np.ones(instances_obj.len_tracklet * 2, dtype=bool) elif self.config.split_setting == "nvs-75": i_train = ~i_test elif self.config.split_setting == "nvs-50": i_train = np.array([(idx + 1) % 4 > 1 for idx in counts[0]]) i_train = np.concatenate((i_train, i_train)) elif self.config.split_setting == "nvs-25": i_train = np.array([idx % 4 == 0 for idx in counts[0]]) i_train = np.concatenate((i_train, i_train)) else: raise ValueError("No such split method") counts = counts.reshape(-1) i_train = counts[i_train] train_timestamps = frame_timestamps[i_train] i_test = counts[i_test] test_timestamps = frame_timestamps[i_test] test_load_image = imageio.imread(image_filenames[0]) image_height, image_width = test_load_image.shape[:2] cx, cy = image_width / 2.0, image_height / 2.0 if split == "train": indices = i_train timestamps = train_timestamps elif split == "val": indices = i_test timestamps = test_timestamps elif split == "test": indices = i_test timestamps = test_timestamps else: raise ValueError(f"Unknown dataparser split {split}") image_filenames = [image_filenames[i] for i in indices] depth_filenames = [depth_name[i] for i in indices] if self.config.use_depth else None normal_filenames = [normal_name[i] for i in indices] if self.pred_normals else None if self.use_semantic: semantic_meta.filenames = [semantic_name[i] for i in indices] poses = poses[indices] if self.config.use_car_latents: if not self.config.car_object_latents_path.exists(): CONSOLE.print("[yello]Error: latents not exist") exit() car_latents = torch.load(str(self.config.car_object_latents_path)) track_car_latents = {} track_car_latents_mean = {} for k, idx in enumerate(car_latents["indices"]): if sequ_frames[0] <= idx["fid"] <= sequ_frames[1]: if idx["oid"] in track_car_latents.keys(): track_car_latents[idx["oid"]] = torch.cat( [track_car_latents[idx["oid"]], car_latents["latents"][k].unsqueeze(-1)], dim=-1 ) else: track_car_latents[idx["oid"]] = car_latents["latents"][k].unsqueeze(-1) for k in track_car_latents.keys(): track_car_latents_mean[k] = track_car_latents[k][..., -1] else: car_latents = None aabb_scale = self.config.scene_scale scene_box = SceneBox( aabb=torch.tensor( [[-aabb_scale, -aabb_scale, -aabb_scale], [aabb_scale, aabb_scale, aabb_scale]], dtype=torch.float32 ) ) cameras = Cameras( camera_to_worlds=torch.from_numpy(poses[:, :3, :4]), fx=focal_X, fy=focal_Y, cx=cx, cy=cy, camera_type=CameraType.PERSPECTIVE, height=image_height, width=image_width, ) dataparser_outputs = DataparserOutputs( image_filenames=image_filenames, cameras=cameras, alpha_color=alpha_color_tensor, scene_box=scene_box, mask_filenames=None, dataparser_scale=self.scale_factor, metadata={ "depth_filenames": depth_filenames, "obj_class": instances_obj.unique_category_ids, "scale_factor": self.scale_factor, "semantics": semantic_meta, "normal_filenames": normal_filenames, "instances": instances_obj, "frame_timestamps": timestamps, }, ) if self.config.use_car_latents: dataparser_outputs.metadata.update( { "car_latents": track_car_latents_mean, "car_nerf_state_dict_path": self.config.car_nerf_state_dict_path, } ) print("finished data parsing") return dataparser_outputs KittiParserSpec = DataParserSpecification(config=MarsKittiDataParserConfig())
Thank you! @wuzirui IIUC, the most important refactor here is using Object Instance instead of a dense obj nodes tensor, which is not mem efficient? But the implement and usage of Object Instance is unknown. So... there is nothing we can really do until the release of this version. If "minor changes should be made to plug it into the current mars release version", it would fall back to the high memory usage version? Correct me if I misunderstood. Thanks a lot.
Hi @Nplace-su! Thank you for your interest. The previous version is not mem efficient because it creates a huge tensor of [N, H, W, C] which contains exactly the same object trajectory info in each pixel. So in this version, we instead use a unique instance trajectory module, which is just a PyTorch dataclass. During the inference time, you just need to index the needed object info from the instance traj module with the frame timestamps. We will release the code, but there's still a lot of work before we can upload them. Hope this short description helps. If there's any problem, feel free to contact me.
Hi @wuzirui and @Nplace-su! Thanks for your contributions and the updated code. So basically in the current version on the mars branch/ refactor all the data is stored and loaded into ram at the same time while it is more or less generator based in the updated version? Is this correct? Also, are you actively working on this? Thanks for your help!
Hi @Nplace-su! Thank you for your interest. The previous version is not mem efficient because it creates a huge tensor of [N, H, W, C] which contains exactly the same object trajectory info in each pixel. So in this version, we instead use a unique instance trajectory module, which is just a PyTorch dataclass. During the inference time, you just need to index the needed object info from the instance traj module with the frame timestamps. We will release the code, but there's still a lot of work before we can upload them. Hope this short description helps. If there's any problem, feel free to contact me.
Thanks @wuzirui for your quick reply and the detailed information! I will try to somehow implement this optimization according to your description. And, hope the official code will be upload asap, XD.
Hi @wuzirui and @Nplace-su! Thanks for your contributions and the updated code. So basically in the current version on the mars branch/ refactor all the data is stored and loaded into ram at the same time while it is more or less generator based in the updated version? Is this correct? Also, are you actively working on this? Thanks for your help!
Hi, the updated version will still store all the trajectory info in RAM, the difference is that, the current version keeps a LOT of redundant data, while the new version remove the redundancy. Also, interpolation on time will be supported.
Hi,in nsg_vkitti_dataparser.py the "obj_info" store all information about the objection in all train dataset, it cost too much RAM, how can i split it to some batch size like "torch.dataloader"?
Hi,in nsg_vkitti_dataparser.py the "obj_info" store all information about the objection in all train dataset, it cost too much RAM, how can i split it to some batch size like "torch.dataloader"?
Hi, please see the updated parser I listed in this issue. It should help solve the RAM issue.
For anyone who met the same issue, you can move the repeat_interleave and permute op to every batch instead of on the whole data. I can bypass the memory peak via this method, as a walkaround.
For anyone who met the same issue, you can move the repeat_interleave and permute op to every batch instead of on the whole data. I can bypass the memory peak via this method, as a walkaround.
Hi~I'm a little bit confused of how to move this action to every batch. Can you help me with the code you modified? Really thanks a lot. @Nplace-su
For anyone who met the same issue, you can move the repeat_interleave and permute op to every batch instead of on the whole data. I can bypass the memory peak via this method, as a walkaround.
Hi~I'm a little bit confused of how to move this action to every batch. Can you help me with the code you modified? Really thanks a lot. @Nplace-su
just remove repeat_interleave and permute in the *_dataparser.py, then in the mars_datamanager.py and mars_pipeline.py, change object_rays_info = self.train_dataset.metadata["obj_info"][c, y, x] to object_rays_info = self.train_dataset.metadata["obj_info"][c], and do repeat_interleave and permute ops to object_rays_info in functions like next_evel_image(), etc, everytime before you use object_rays_info. Feel free to ask if you meet any problems. @Yurri-hub
For anyone who met the same issue, you can move the repeat_interleave and permute op to every batch instead of on the whole data. I can bypass the memory peak via this method, as a walkaround.
Hi~I'm a little bit confused of how to move this action to every batch. Can you help me with the code you modified? Really thanks a lot. @Nplace-su
just remove repeat_interleave and permute in the *_dataparser.py, then in the mars_datamanager.py and mars_pipeline.py, change object_rays_info = self.train_dataset.metadata["obj_info"][c, y, x] to object_rays_info = self.train_dataset.metadata["obj_info"][c], and do repeat_interleave and permute ops to object_rays_info in functions like next_evel_image(), etc, everytime before you use object_rays_info. Feel free to ask if you meet any problems. @Yurri-hub
Thanks a lot. I'll try it and feed back soon.
Thanks to your help, I have successfully optimized the memory usage, and both the training and evaluation processes are functioning normally now.
However, I haven't used any permute ops for object operations, probably because the order itself is already correct.
Just like this:
Once again, thank you very much for your assistance!
@Nplace-su
Thanks to your help, I have successfully optimized the memory usage, and both the training and evaluation processes are functioning normally now. However, I haven't used any permute ops for object operations, probably because the order itself is already correct. Just like this:
![]()
Once again, thank you very much for your assistance! @Nplace-su
@Yurri-hub Nice, thanks for your information. Actually, I have no idea what the permute op is for, so I just keep it for safety LOL.(After a more careful look, you didn't use permute ops because you repeat_interleave on dim 0, which is equal to the original repeat_interleave on dim 1 and then permute)
Hi, while loading the vkitti2 dataset I'm experiencing huge RAM loads of up to 200 GB, on the master and refactor branches. Since the dataset itself is only around 16 GB I was wondering whether this is intended/ necessary. Do you have any ideas why this occurs?