fuenwang / 3D-BoundingBox

PyTorch implementation for 3D Bounding Box Estimation Using Deep Learning and Geometry
128 stars 39 forks source link

Visualize the 3D coordinate #3

Open herleeyandi opened 6 years ago

herleeyandi commented 6 years ago

Hello I am very new in this research. How can we draw the 3D bounding boxes? I have read the paper and still don't understand how to map from our 3D coordinate to the image which is 2D? From your code we can find dimension, center point, also angle, but how can we draw it to the image using opencv? -Thank You- image image

fuenwang commented 6 years ago

If you want to project 3D point to 2D image, you will need camera intrisic which is provided by KITTI. But I didn't include the infomation in my code. The label file you download from KITTI will give you K[R|T]. You can use the KITTI parser of PyDriver to help you parse the label file.

herleeyandi commented 6 years ago

Hello @fuenwang sorry for several days I still confuse how to get 8 point of coordinate. Firstly we have orientation from the model what is this variable contains?, we also have so many angle like Ray, ThetaRay, and LocalAngle, then what our model predict exactly? Suppose we have an orientation then what is the relation between our predicted orientation with the R in K[R|T]? as far as I know R is geometric rotation matrix which is [[cos(x),sin(x)],[sin(x), cos(x)]] then what is x in here?, is it the LocalAngle? Where can I see the KITTI's label mean?, they have 15 element where the first is the class, then we have 14 other number in there and I can't find any reference about what is this number exactly. -Thank you- image

fuenwang commented 5 years ago
  1. According to the paper, there are two prediction from our model. The first is the revised angle for each bin of the circle (orient), the second one is the confidence of each bin. So if the orient is 20 degree and the corresponding bin is 120 degree, then the localangle(theta_l in Fig 3) will be 120+20=40 degree.

  2. K[R|T] is for camera itself, not the object. K is the intrinsic, R and T is the rotation (3 DoG) and translation (3 DoG) for camera, which is provided by KITTI. To get the translation of object, you need solve all combination with relation in Section 3.2, but I'm not finishing this part.

  3. For the description, you can download the develop kit image The readme file has a full description.

herleeyandi commented 5 years ago

Thank you so much @fuenwang very brief explanation. Another question as we know we need the ROI of object, Camera Intrinsic, dimension, and also location to get the final 3D boundingBox. In the model itself we just input cropped picture then we can get dimension, orientation, and also the confidence. For the Camera Intrinsic in this implementation I need it from the Kitti atcalib folder, but Where can we get the center location (T in the equation) in the model predicted result?

fuenwang commented 5 years ago

According to the KITTI documentation here, The P2 of files in calib/ is the projection matrix of left color camera, which dimension is 3x4. This projection matrix is K[R|T] and I think it's what you looking for?

herleeyandi commented 5 years ago

Hello @fuenwang Thank you for your explanation. I have read the KITTI documentation carefully, I also looks another code from another github repository. I still confuse about many things. Let me met summary here as follows: *OUR GOAL Only detect 3D Bounding box, let me ignore 2D Bounding box, also class. *WHAT WE HAVE First we have K which is camera intrinsic, can be found in file calib in P2 section 2D Bounding box in labels from index 4 to 7 Image in image folder *WANT TO COMPUTE 8 Points of 3D bounding box *WHAT WE NEED Caliberation -> provided 2D bounding box -> provided cropped image before VGG19-> Provided Rotation (R) ->Unknown Dimension -> Unknown Location / Center/ T in K[RT] ->Unknown *WHAT MODEL PREDICT Orientation -> get Theta Dimension -> Exactly we need Confidence -> Not sure is it class of object or The multibins and confuse what it look like Then where is the Location/Center/T in K[RT]? *WHAT I SEE Most of people ignore this, they just use the labels from index 11 to 13. Another parameter is Ry which is index 14 in label, but we can find it using orientation that predicted by model. *THEREFORE Suppose I want have a program to regress 3D Bbox and I have Object detection method let said YOLO, Faster RCNN, ETC. Then I need to crop image in 2D BBox area. I also compute camera calliberation, so the last thing I need is the Location/ Center/ T in K[RT]. Where can I find it and how? *ACCORDING TO KITTI We need to find dimension and also Location. image Sorry for long post, and thank you so much for teaching me. Please correct me if something wrong in there.

fuenwang commented 5 years ago
  1. For the confidence our model predict is try to know the confidence which bin the theta is actually located at. So the dimension will be [batch_size x num of bins], which is applied softmax to get probability.

  2. I think the location is the labels from index 11 to 13, which is already in camera coordinate. Because our model will provide rotation, we have R now. To get the translation, we can consider the equation [0 0 0]^T = R location^T + T, then T = -R location^T.

superZ678 commented 5 years ago

@fuenwang Hello, I recently ran you the code, I have a few questions to ask you, first of all, I want to know you this piece of code is the final output only shows the "Angle error" and "Dimension error" two results, that I want to see in the 2 d image generated 3 d bounding box where the coordinates of the point should be check?, then I see the article on 2 d image exactly on the generated 3 d bounding box, so how do I make 3 d visual display box? This problem has been bothering me for a long time, I mainly want to get is actually want to 3 d bounding box in the image coordinates of points and can show 3 d box in the image above.Since I am a novice, I would like to ask you to help me out. Thank you very much.

superZ678 commented 5 years ago

@herleeyandi ,Hello, I would like to ask you a few questions, does the output theata Angle in the code refer to the global direction Angle in the article?And the index 14 parameter of label in the data set is the ground true of the object global direction Angle predicted by the paper, right?

fuenwang commented 5 years ago

@tingZ123 I just saw this issue now haha. In my eval code, I only eval the angle error and dimension error. If you want to obtain the 3D box, you have to calculate the object translation as described in paper Sec 3.2 . But I didn't implement this part in this repo.
The theta here(https://github.com/fuenwang/3D-BoundingBox/blob/master/Eval.py#L76) should be the global angle. The 14 index parameter is the ground truth global direction.

superZ678 commented 5 years ago

@fuenwang I am honored to receive your reply.However, I have a few questions. 1、The Angle error output of code is the difference between theta and Ry, and Ry is The 14 index parameter value corresponding to label.Isn't it? theta is between the direction of the object and the X-axis in the camera coordinate system, right?
2、In the code: theta = np.arctan2(sin, cos) / np.pi 180 ; theta = theta + centerAngle[argmax] / np.pi 180; theta = 360 - info['ThetaRay'] - theta ; the second theta is Local orientation angles in the article?and the third theta is Global orientation angles in the article,right? 3、Is the physical meaning of ThetaRay in the code the same as the θray in the article? 4、In your eval code,If I want to input one image ,and see eval the angle error and dimension error of each objects in the picture. waht should I do ?

fuenwang commented 5 years ago

For Q1,2,3, that is all correct.

For Q4: You can print the Variable "batch"(https://github.com/fuenwang/3D-BoundingBox/blob/master/Eval.py#L56), the dimension will be [bs, ch, 224, 224] just change it to the image you want to input.

superZ678 commented 5 years ago

@fuenwang 1、 In this paper, the θray refers to the Angle between the observed direction and the X-axis,in the codehttps://github.com/fuenwang/3D-BoundingBox/blob/master/Library/Dataset.py#L32,It has the same physical meaning, right? 2、I actually want to input a specific picture and then output the global orientation Angle(theta) and dimension(dim) of each object in the picture. How should I modify it?

fuenwang commented 5 years ago
  1. Yes, that is the same meaning.
  2. chane the batch input image (https://github.com/fuenwang/3D-BoundingBox/blob/master/Eval.py#L56) and then the theta of this line is the global angle (https://github.com/fuenwang/3D-BoundingBox/blob/master/Eval.py#L76)
superZ678 commented 5 years ago

@fuenwang Thank you very much for your answer. I have a general idea of how to do it. but I also need to ask you, Is the fourth parameter "alpha" in the label of 2D detection data set and ThetaRay in the code an Angle?

tonysy commented 5 years ago

I use the following code to plot 3D box

def plot_3d_bbox(img, label_info):

    alpha = label_info['alpha']
    # theta_ray = label_info['theta_ray']
    box_3d = []
    center = label_info['location']
    dims = label_info['dimension']
    calib = label_info['calib']
    cam_to_img = calib['P2']
    rot_y = alpha / 180 * np.pi  + np.arctan(center[0]/center[2])      
    # import pdb; pdb.set_trace()

    for i in [1,-1]:
        for j in [1,-1]:
            for k in [0,1]:
                point = np.copy(center)
                point[0] = center[0] + i * dims[1]/2 * np.cos(-rot_y+np.pi/2) + (j*i) * dims[2]/2 * np.cos(-rot_y)
                point[2] = center[2] + i * dims[1]/2 * np.sin(-rot_y+np.pi/2) + (j*i) * dims[2]/2 * np.sin(-rot_y)                  
                point[1] = center[1] - k * dims[0]

                point = np.append(point, 1)
                point = np.dot(cam_to_img, point)
                point = point[:2]/point[2]
                point = point.astype(np.int16)
                box_3d.append(point)

    front_mark = []
    for i in range(4):
        point_1_ = box_3d[2*i]
        point_2_ = box_3d[2*i+1]
        cv2.line(img, (point_1_[0], point_1_[1]), (point_2_[0], point_2_[1]), (255,0,0), 2)
        if i == 0 or i == 3:
            front_mark.append((point_1_[0], point_1_[1]))
            front_mark.append((point_2_[0], point_2_[1]))

    cv2.line(img, front_mark[0], front_mark[-1], (255,0,0), 2)
    cv2.line(img, front_mark[1], front_mark[2], (255,0,0), 2)

    for i in range(8):
        point_1_ = box_3d[i]
        point_2_ = box_3d[(i+2)%8]
        cv2.line(img, (point_1_[0], point_1_[1]), (point_2_[0], point_2_[1]), (255,0,255), 2)

    return img

Hope it helps!

lucasjinreal commented 5 years ago

@tonysy Despite of the model predict. If I provide the ground truth label and calib file, can you draw out the 3D box? I dont think you can with this scripts.

Truck 0.00 0 -1.57 599.41 156.40 629.75 189.25 2.85 2.63 12.34 0.47 1.49 69.44 -1.56
Car 0.00 0 1.85 387.63 181.54 423.81 203.12 1.67 1.87 3.69 -16.53 2.39 58.49 1.57

This is groud truth label in KITTI. So, what is alpha and theta_ray? Which one is your centers? If you using the centers of groud truth you are already knowing where 3d box is.

As a brief conclusion, you just can not do a 3d bouding box predict with 2d predicted box and this network prediction.

lucasjinreal commented 5 years ago

image This is what got...... using the groud truth, what is wrong?

fuenwang commented 5 years ago

Alpha is the observation angle but we don't need to use it. For theta_ray, you can choose the center location of cropped 2d box (x, y). And then you can infer theta_ray, this is also described is this fig image

But during my training, I directly use ground truth to infer theta_ray.

lucasjinreal commented 5 years ago

@fuenwang Thanks so much. After edit, I directly using roration_y in label rather than alpha, then got right result now:

image

But would like to let me ask a few questions?

  1. In groud truth, we using rotation_y and centers dimensions and calib, those 4 params enough to get a 3d box and project it on image. the model how to inference out the rotation value? And what this roration_y actually be?

  2. those centers are using directly in labels are: 3D coordinate in camera but we just got dimension, not the centers, how to get centers?

fuenwang commented 5 years ago
  1. rotation_y is the theta(red) which is global angle of objects. And the theta in this line https://github.com/fuenwang/3D-BoundingBox/blob/master/Eval.py#L72 is the predicted rotation_y.

  2. The center of cropped image may not go through the center of 3D box, it is just an estimation but will have some small error. But we still need to have this assumption to infer translation. After obtaining translation, we can convert to object 3D center. So for now we have object rotation, dimension, 2D cropped image bounding box the we can optimize an object location by the three information as described in Sec. 3.2 of paper.

lucasjinreal commented 5 years ago

@fuenwang thanks.

What I means is:

the centers we using for calculate rotation_y is 3 dimension. If you using center of bounding box, it is 2d. where to get the needed 3d centers?

fuenwang commented 5 years ago

Yes, we only have 2D location of 2D bounding box. So we have to estimate a 3D location of object by projecting 3D location to 2D location which is constraint by the coordinate of the 2D box. Because we have a fixed dimension 3d box and its rotation, so theoretically the object location can also be inferred. And the whole Sec 3.2 tells us how to do it.

lucasjinreal commented 5 years ago

@fuenwang I saw the codes, you wanna get rotation, you must know the centers of 3d coordinate in camera world coordinate system. But you just got the dimensions, and a value of alpha, what confused me is that:

  1. another value using for what? never seen them use;
  2. 3d centers can be infered using 2d center map it to 3d by image to cam calibration. Then we do not need another value at all
yuyijie1995 commented 5 years ago

@herleeyandi Hi How can you get the center point ,I only find the orient,conf and dim in the eval.py

cryax commented 5 years ago

@yuyijie1995 "3d centers can be infered using 2d center map it to 3d by image to cam calibration" since 2d center and camera calibration info is provided so I guess you can retrieve this info directly.

rebelzion commented 5 years ago

How can we get the 3D center ? Sure, we can assume the center of the 2D bbox will roughly be the center of the 3D bbox, but what's the depth value with whom we backproject the 2D bbox center to 3D coordinate ?

cryax commented 5 years ago

author of this repo told us part 3.2 in the paper help us calculate this information. As I understand to retrieve the 3D center coordinate we need to calculate T based on R, 2D center coordinates and K (intrinsic matrix) (formula is provided in Supplementary Material). However, I'm still not sure how to do this.

tjufan commented 5 years ago

@jinfagang I am also confused with drawing the 3D bounding box with the ground truth, could you introduce your method in details or show your code for it?

cryax commented 5 years ago

@tjufan You need to download calibration file, label and left image file. selection_108 ``

tjufan commented 5 years ago

@cryax, thanks, I have got it.

cryax commented 5 years ago

@tjufan You can use the code below for drawing boxes. I get this code mostly from @jinfagang

import numpy as np
import cv2
colors = {
    'green': (0, 255, 0),
    'pink': (255, 0, 255),
    'blue': (0, 0, 255)
}
def get_calibration_cam_to_image(cab_f):
    for line in open(cab_f):
        if 'P2:' in line:
            cam_to_img = line.strip().split(' ')
            cam_to_img = np.asarray([float(number) for number in cam_to_img[1:]])
            cam_to_img = np.reshape(cam_to_img, (3, 4))
    return cam_to_img

def plot_3d_bbox(img, label_info, cam_to_img, is_gt=True):
#    print('current label info: ', label_info)
    alpha = label_info['alpha']
    # theta_ray = label_info['theta_ray']
    box_3d = []
    center = label_info['location']
    box_2d = label_info['box_2d']
    dims = label_info['dimension'] 

    cam_to_img = cam_to_img#label_info['calib']

    if is_gt:
        rot_y = label_info['rot_y']
    else:
        rot_y = alpha / 180 * np.pi + np.arctan(center[0] / center[2])
        # import pdb; pdb.set_trace()

    for i in [1, -1]:
        for j in [1, -1]:
            for k in [0, 1]:
                point = np.copy(center)
                point[0] = center[0] + i * dims[1] / 2 * np.cos(-rot_y + np.pi / 2) + (j * i) * dims[2] / 2 * np.cos(
                    -rot_y)
                point[2] = center[2] + i * dims[1] / 2 * np.sin(-rot_y + np.pi / 2) + (j * i) * dims[2] / 2 * np.sin(
                    -rot_y)
                point[1] = center[1] - k * dims[0]

                point = np.append(point, 1)
                point = np.dot(cam_to_img, point)
                point = point[:2] / point[2]
                point = point.astype(np.int16)
                box_3d.append(point)
    front_mark = []
    for i in range(4):
        point_1_ = box_3d[2 * i]
        point_2_ = box_3d[2 * i + 1]
        cv2.line(img, (point_1_[0], point_1_[1]), (point_2_[0], point_2_[1]), colors['pink'], 1)

    for i in range(8):
        point_1_ = box_3d[i]
        point_2_ = box_3d[(i + 2) % 8]
        cv2.line(img, (point_1_[0], point_1_[1]), (point_2_[0], point_2_[1]), colors['pink'], 1)

    return img

def main():
    img_file = './data/image_2/007091.png'
    label_file = './data/label_2/007091.txt'
    calib_file = './data/calib_2/007091.txt'
    image = cv2.imread(img_file, cv2.COLOR_BGR2RGB)

    box_2d = ''
    with open(label_file, 'r') as f:
        label_info = dict()
        for l in f.readlines():
            l = l.strip().split(' ')

            # this angle need a preprocess
            label_info['alpha'] = float(l[3])
            label_info['box_2d'] = np.asarray(l[4:8], dtype=float)
            box_2d = label_info['box_2d']
            label_info['location'] = np.asarray(l[11: 14], dtype=float)
            label_info['dimension'] = np.asarray(l[8: 11], dtype=float)
            camp_to_img = get_calibration_cam_to_image(calib_file)
            label_info['rot_y'] = float(l[14])
            label_info['box'] = np.asarray(l[4: 7], dtype=float)

            image = plot_3d_bbox(image, label_info, camp_to_img)

    return image

if __name__ == '__main__':
    img = main()
    cv2.imshow('cc', img)
    cv2.waitKey()
yuyijie1995 commented 5 years ago

@cryax Hi, thanks for your suggestion ,In deed I read the paper 3.2 part and I still don't konw how use the method mentioned in the paper to calculate the T. And like you said I can use the code provided by @jinfagang to draw the 3D box .But ,the box I drawed looks not right. 2018-11-27 11-18-44

cryax commented 5 years ago

@yuyijie1995 use my attached code for drawing and use the left color not the right color image.

skhadem commented 5 years ago

@yuyijie1995 I used 'ThetaRay' instead of alpha for plotting the 3D ground truth and the boxes fit

skhadem commented 5 years ago

Also this code does not use purely net output to calculate Location, but I have been working for about a week to solve the following for Tx Ty Tz, I think I am close to understanding how to do it. If anyone has any suggestions that would help too. screenshot from 2018-11-29 09-32-00

cryax commented 5 years ago

@skhadem I'm still confuse about how to calculate T. We have 4 points of the 2D box, rotation (R), dimension D, intrinsic matrix (K). But how about ith X_o?

cryax commented 5 years ago

I found a great note here: http://ywpkwon.github.io/pdf/bbox3d-study.pdf Hope this will help us to reproduce the paper's work.

rebelzion commented 5 years ago
image

Here he assumes (xmin, ymin, xmax, ymax) = (Xa, Xb, Xc, Xd) ?

rebelzion commented 5 years ago
image

Did some of you understood this part ? (it's not very readable imo)

skhadem commented 5 years ago

That link is pretty helpful, if I understand correctly the equation that he is setting up is Ax=b where x is the translation, A is some values that are derived from the regressed dimension orientation and camera intrinsics, and b is a 4x1 with the 2D constaints. @whoiscris my understanding is that whole green section is expanding one of the rows of A, based on the first 2D constraint. So you would repeat that green section 4 times (one for each 2D constraint). I could be wrong, that's just my first take on it.

skhadem commented 5 years ago

Still trying to figure out which corners to use for X_0, but the paper mentions: "we can solve for the translation T that minimizes the reprojection error with respect to the initial 2D detection box constraints" so my guess is since we have a least squares problem, there will always be some error, and we try each possibility (there's 64) and take the one with the least error.

rebelzion commented 5 years ago
image

So here, he means: (xmin, ymin, xmax, ymax) = (Xa, Xb, Xc, Xd), meaning the 2d xmin,ymin,xmax,ymax are projections of some 4 3D points from those 8 , right ?

skhadem commented 5 years ago

That is my understanding, yeah. So there are 64 combinations of X_0 that could potentially give that same xmin, ymin, xmax, ymax. I believe we want to try all of them and use the one with least error.

skhadem commented 5 years ago

I setup the code to be ready for the math on my fork in Run.py: https://github.com/skhadem/3D-BoundingBox I will try and get it working this weekend, will keep you guys updated

cryax commented 5 years ago

@skhadem How's your progress?

skhadem commented 5 years ago

@cryax Really close, last thing I need is to find a way to loop through the 8 3D corners and get the 64 combinations of corners to use as constraints, shouldn't be too hard, but if anyone can think of a good way to do it let me know. Also, I did the math in that link you sent on paper and wrote out more details if anyone is interested in seeing that. Makes complete sense now, that scaling factor was key (paper never seems to mentioned it).

rebelzion commented 5 years ago

@skhadem can you post your writings so far ? I will work on it too.

skhadem commented 5 years ago

I think slack is not a bad place to do all of this. Might be overkill, but we can discuss more about this. For everyone, I guess mostly those interested: @cryax @whoiscris @yuyijie1995, click link to join: https://join.slack.com/t/3dboundingbox-oun9186/shared_invite/enQtNDk4Njg2NzYyNzY5LWVlZWRlMjNhZmZlYjVmNGY3NWVlNDA4MmY2ZWQ3ZmUyY2Q4OWIxMmY4NzU4YmViM2ViZWI5YjgxOTIyOTI4ZjI

herleeyandi commented 5 years ago

Hello everyone, @tingZ123 and @yuyijie1995 I am sorry for my late reply, currently, I don't continue to learn this since I move my research direction, but soon I will try to know deeper about this. Personally I can't answer all of the question since I forget all of the concept in this paper :cry: . However for my last check point is just plotting the 3D bounding boxes from ground truth file. Big sorry for my late reply. Here is my code.

import matplotlib.pyplot as plt 

f = '000500.png'
image_file = image_dir + f
calib_file = calib_dir + f.replace('png', 'txt')
predi_file = label_dir + f.replace('png', 'txt')

# read calibration data
for line in open(calib_file):
    if 'P2:' in line:
        cam_to_img = line.strip().split(' ')
        cam_to_img = np.asarray([float(number) for number in cam_to_img[1:]])
        cam_to_img = np.reshape(cam_to_img, (3,4))

image = cv2.imread(image_file)
cars = []

# Draw 3D Bounding Box
for line in open(predi_file):
    line = line.strip().split(' ')

    dims   = np.asarray([float(number) for number in line[8:11]])
    center = np.asarray([float(number) for number in line[11:14]])
    rot_y  = float(line[3]) + np.arctan(center[0]/center[2])#float(line[14])

    box_3d = []

    for i in [1,-1]:
        for j in [1,-1]:
            for k in [0,1]:
                point = np.copy(center)
                point[0] = center[0] + i * dims[1]/2 * np.cos(-rot_y+np.pi/2) + (j*i) * dims[2]/2 * np.cos(-rot_y)
                point[2] = center[2] + i * dims[1]/2 * np.sin(-rot_y+np.pi/2) + (j*i) * dims[2]/2 * np.sin(-rot_y)                  
                point[1] = center[1] - k * dims[0]

                point = np.append(point, 1)
                point = np.dot(cam_to_img, point)
                point = point[:2]/point[2]
                point = point.astype(np.int16)
                box_3d.append(point)

    for i in range(4):
        point_1_ = box_3d[2*i]
        point_2_ = box_3d[2*i+1]
        cv2.line(image, (point_1_[0], point_1_[1]), (point_2_[0], point_2_[1]), (0,255,0), 2)

    for i in range(8):
        point_1_ = box_3d[i]
        point_2_ = box_3d[(i+2)%8]
        cv2.line(image, (point_1_[0], point_1_[1]), (point_2_[0], point_2_[1]), (0,255,0), 2)

image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
plt.imshow(image)

Here is the result. image