Clarify the names of the files used in the Object Detection Tutorial

Greendogo commented 8 years ago

The specific files used in the Object Detection Tutorial from the KITTI Vision Benchmark aren't very well defined in the file: examples/object-detection/README.md A few comments were added at the bottom of #803

Please specify which names and the sizes of those files (in case the names change) to increase readability in the tutorial documentation.

Greendogo commented 8 years ago

Links would be helpful.

Greendogo commented 8 years ago

Created a pull request https://github.com/NVIDIA/DIGITS/pull/868

kumarabhinavgupta commented 8 years ago

Also, please clarify format of the label text file. Thanks.

kumarabhinavgupta commented 8 years ago

We are not sure what is the right format for DIGIT training Label file for new Object Detection example.

From KITTI dataset we figured out the following.

2 folders of images and labels for each of Training and Validation. And label files to contain text files (with same name as images) with data in follwing format:

" Car 0.88 3 -0.69 0.00 192.37 402.31 374.00 1.60 1.57 3.23 -2.70 1.74 3.68 -1.29 Car 0.00 1 2.04 334.85 178.94 624.50 372.04 1.57 1.50 3.68 -1.17 1.65 7.86 1.90 Car 0.34 3 -1.84 937.29 197.39 1241.00 374.00 1.39 1.44 3.08 3.81 1.64 6.15 -1.31 Car 0.00 1 -1.33 597.59 176.18 720.90 261.14 1.47 1.60 3.66 1.07 1.55 14.44 -1.25 Car 0.00 0 1.74 741.18 168.83 792.25 208.43 1.70 1.63 4.08 7.24 1.55 33.20 1.95 Car 0.00 0 -1.65 884.52 178.31 956.41 240.18 1.59 1.59 2.47 8.48 1.75 19.96 -1.25 DontCare -1 -1 -10 800.38 163.67 825.45 184.07 -1 -1 -1 -1000 -1000 -1000 -10 DontCare -1 -1 -10 859.58 172.34 886.26 194.51 -1 -1 -1 -1000 -1000 -1000 -10 DontCare -1 -1 -10 801.81 163.96 825.20 183.59 -1 -1 -1 -1000 -1000 -1000 -10 DontCare -1 -1 -10 826.87 162.28 845.84 178.86 -1 -1 -1 -1000 -1000 -1000 -10

"

This is not working. So can someone please help us out on what we are doing wrong ?

Abhinav

lukeyeager commented 8 years ago

@kumarabhinavgupta here is the relevant info from devkit_object.zip > readme.txt:

Data Format Description
=======================

The data for training and testing can be found in the corresponding folders.
The sub-folders are structured as follows:

  - image_02/ contains the left color camera images (png)
  - label_02/ contains the left color camera label files (plain text files)
  - calib/ contains the calibration for all four cameras (plain text file)

The label files contain the following information, which can be read and
written using the matlab tools (readLabels.m, writeLabels.m) provided within
this devkit. All values (numerical or strings) are separated via spaces,
each row corresponds to one object. The 15 columns represent:

#Values    Name      Description
----------------------------------------------------------------------------
   1    type         Describes the type of object: 'Car', 'Van', 'Truck',
                     'Pedestrian', 'Person_sitting', 'Cyclist', 'Tram',
                     'Misc' or 'DontCare'
   1    truncated    Float from 0 (non-truncated) to 1 (truncated), where
                     truncated refers to the object leaving image boundaries
   1    occluded     Integer (0,1,2,3) indicating occlusion state:
                     0 = fully visible, 1 = partly occluded
                     2 = largely occluded, 3 = unknown
   1    alpha        Observation angle of object, ranging [-pi..pi]
   4    bbox         2D bounding box of object in the image (0-based index):
                     contains left, top, right, bottom pixel coordinates
   3    dimensions   3D object dimensions: height, width, length (in meters)
   3    location     3D object location x,y,z in camera coordinates (in meters)
   1    rotation_y   Rotation ry around Y-axis in camera coordinates [-pi..pi]
   1    score        Only for results: Float, indicating confidence in
                     detection, needed for p/r curves, higher is better.

Here, 'DontCare' labels denote regions in which objects have not been labeled,
for example because they have been too far away from the laser scanner. To
prevent such objects from being counted as false positives our evaluation
script will ignore objects detected in don't care regions of the test set.
You can use the don't care labels in the training set to avoid that your object
detector is harvesting hard negatives from those areas, in case you consider
non-object regions from the training images as negative examples.

The coordinates in the camera coordinate system can be projected in the image
by using the 3x4 projection matrix in the calib folder, where for the left
color camera for which the images are provided, P2 must be used. The
difference between rotation_y and alpha is, that rotation_y is directly
given in camera coordinates, while alpha also considers the vector from the
camera center to the object center, to compute the relative orientation of
the object with respect to the camera. For example, a car which is facing
along the X-axis of the camera coordinate system corresponds to rotation_y=0,
no matter where it is located in the X/Z plane (bird's eye view), while
alpha is zero only, when this object is located along the Z-axis of the
camera. When moving the car away from the Z-axis, the observation angle
will change.

To project a point from Velodyne coordinates into the left color image,
you can use this formula: x = P2 * R0_rect * Tr_velo_to_cam * y
For the right color image: x = P3 * R0_rect * Tr_velo_to_cam * y

Note: All matrices are stored row-major, i.e., the first values correspond
to the first row. R0_rect contains a 3x3 matrix which you need to extend to
a 4x4 matrix by adding a 1 as the bottom-right element and 0's elsewhere.
Tr_xxx is a 3x4 matrix (R|t), which you need to extend to a 4x4 matrix 
in the same way!

Note, that while all this information is available for the training data,
only the data which is actually needed for the particular benchmark must
be provided to the evaluation server. However, all 15 values must be provided
at all times, with the unused ones set to their default values (=invalid) as
specified in writeLabels.m. Additionally a 16'th value must be provided
with a floating value of the score for a particular detection, where higher
indicates higher confidence in the detection. The range of your scores will
be automatically determined by our evaluation server, you don't have to
normalize it, but it should be roughly linear. If you use writeLabels.m for
writing your results, this function will take care of storing all required
data correctly.

Do you find that helpful? If so, I'll find a way to embed it in the UI and/or the example's README.

kumarabhinavgupta commented 8 years ago

Thanks Luke. This was extremely helpful..!

lukeyeager commented 8 years ago

Closed by #876

AbhinavDS commented 7 years ago

My question is regarding alpha and rotation_y. Given the camera matrix, is it possible to convert one to another? If yes, could you please tell how? Thanks.

freelist commented 3 years ago

My question is regarding alpha and rotation_y. Given the camera matrix, is it possible to convert one to another? If yes, could you please tell how? Thanks.

Did you get any answer on this? I would like to find a way for converting Transformation matrix e.g. [R|t;0 0 0 1] to alpha and rotation_y directly? Do you have any hint?

EddSB commented 1 year ago

Looking at a few images and labels, I have come to the following conclusions (not sure if they are correct)

The rotation_y is the object's rotation in the camera coordinates, independent from the object's position, as seen in the first picture. (Please excuse the crude programmer art)

Camera_Coords

The alpha, observation angle, is the rotation_y minus the observation angle, theta. That is the reason both values are similar for far away objects (and usually closer to the center of the picture). alpha

That is what fits my observations at least. If someone finds observation that say otherwise, please correct me.

NVIDIA / DIGITS

Clarify the names of the files used in the Object Detection Tutorial #866