aushani / tsf

Temporal Scene Flow (ICRA 2017)
http://www.aushani.com/pdfs/aushani-2017a.pdf
7 stars 4 forks source link

[P0] Problems in your implementation and experiments #4

Open yiakwy opened 3 years ago

yiakwy commented 3 years ago

Background

I have compiled your program successfully with some report that

# build
1. The OSG depends on GCC 7.5 (>4.8), and libraries should be exported to global searching path
2. CUDA 10.0 is just enough

# algorithm
The algorithm builds occupancy gird at cost of 500ms, which is not included in the academic report and I also noticed that the CUDA implementation is not optimized:
- no grid searching in GPU devices, this is suggested in latest arch
- frequently copy of memory of velodyne points coordinates in  both CPU and GPU devices
- Pyramid occupancy grid interpolation is done in CPU side

Problems

The are three major problems with the implementation. @aushani

Prob No.1: BG Filter

Usually, scene flow is working as a background filter before sending marked dynamic points to classifier. However in this project, as defined in callback function "update_scan" from "app/flow/flwo_app.cpp", you used a regress model as a bg filter between computing flow and basic ROI filtering (ground extraction, remove points not on the road surface).

Prob No.2: Raytracing with occupancy grid

I noticed that you claimed that you use raytracing to build occupancy grid. However when I digged into your codes and debug using GDB, I found that you didn't use structured point cloud -- this is not the usually case in KITTI dataset.

Unstructured point cloud means, that the function velodyne_decode_packed defined in "library/sensors/velodyne.cc" load_velodyne_file defined in "library/utils/utils.cc" are never called in the program, and we are not scanning points w.r.t beam columns, i.e. no yaw and pitch data provided for each 3d point.

Which implying that your GPU algorithm is on column-line basis but point - line basis, costing 500 ms, no faster than using hash table in CPU devices.

Prob No3.: Solver

It seems that your are actually using Lucas-Kande pyramid iteration scheme with a single layer, i.e. no scale, no speed interpolation between different scales of layers.

For each iteration, since Flow is as simple as a regression problem, I cannot catch up with your main point of view. why did you chose EM scheme? It doesn't make any sense.

aushani commented 3 years ago

Hi, and thanks for your interest! Unfortunately this is fairly old work and hasn't been actively maintained for a long time, but I'll do my best.

Background

I have compiled your program successfully with some report that


# build
1. The OSG depends on GCC 7.5 (>4.8), and libraries should be exported to global searching path
2. CUDA 10.0 is just enough

I wouldn't be surprised if the build dependencies are out of date by now. When I was writing this code I believe only CUDA 7.5 was out (or maybe even older).

algorithm

The algorithm builds occupancy gird at cost of 500ms, which is not included in the academic report and I also noticed that the CUDA implementation is not optimized:

  • no grid searching in GPU devices, this is suggested in latest arch
  • frequently copy of memory of velodyne points coordinates in both CPU and GPU devices
  • Pyramid occupancy grid interpolation is done in CPU side

Please refer to the paper (in particular section VI) for the full timing statistics, compute hardware used, and algorithms. I was using state-of-the-art computer hardware as of 2016. I wouldn't be surprised at all if newer GPU architectures over the past 4 years enabled new algorithms that improved performance even further. I encourage you to see what you can do!

Velodyne point coordinates have to be copied for each new scan. I'm not sure what pyramid interpolation you're referring to - maybe my memory is dusty but the word pyramid doesn't show up once in the codebase.

Problems

The are three major problems with the implementation. @aushani

Prob No.1: BG Filter

Usually, scene flow is working as a background filter before sending marked dynamic points to classifier. However in this project, as defined in callback function "update_scan" from "app/flow/flwo_app.cpp", you used a regress model as a bg filter between computing flow and basic ROI filtering (ground extraction, remove points not on the road surface).

I'm not sure why this is an issue. Having a cheap background filter to quickly identify static areas with no motion was very effective, especially for the targeted application of self-driving cars that are driving on roads with a lot of fixed infrastructure. I'm sure you could do it the other way (compute dense flow for everything, and then use that to help classify background/foreground) if your particular application warranted it, but each method has its own tradeoffs.

Prob No.2: Raytracing with occupancy grid

I noticed that you claimed that you use raytracing to build occupancy grid. However when I digged into your codes and debug using GDB, I found that you didn't use structured point cloud -- this is not the usually case in KITTI dataset.

Unstructured point cloud means, that the function velodyne_decode_packed defined in "library/sensors/velodyne.cc" load_velodyne_file defined in "library/utils/utils.cc" are never called in the program, and we are not scanning points w.r.t beam columns, i.e. no yaw and pitch data provided for each 3d point.

Which implying that your GPU algorithm is on column-line basis but point - line basis, costing 500 ms, no faster than using hash table in CPU devices.

Please see gpu_occ_grid_builder.cu.cc for the ray tracing implementation. In particular, the CUDA kernel is called ray_trace_rw_gpu.

There are some additional functions in the code that are never called here. The velodyne decoding functions you mentioned were used in the full self-driving car codebase to interact with the sensors, but aren't necessary here because we're just loading point clouds directly from KITTI.

Prob No3.: Solver

It seems that your are actually using Lucas-Kande pyramid iteration scheme with a single layer, i.e. no scale, no speed interpolation between different scales of layers.

For each iteration, since Flow is as simple as a regression problem, I cannot catch up with your main point of view. why did you chose EM scheme? It doesn't make any sense.

Please see the paper (in particular, section V-B) for the full details on the algorithm. While in the literature many people use many different methods and ways for computing flow, there is no pyramid scheme here in this algorithm. Also see flow_solver.cu.cc, in particular the functions flow_solver_cookie1 and flow_solver_cookie2 (named in this way because I was eating a very delicious and memorable cookie when I wrote that section of code).