CUDA support for registration

fengjim commented 7 years ago

Overview Description:

I'm using libfreenect2 to collect RGB and Depth Data from Kinect2 devices on Linux (Ubuntu 14.04) and generate PCL point cloud based on that.

The steps are generally: 1) libfreenect2::SyncMultiFrameListener::waitForNewFrame() to get the RGB and Depth frames, 2) libfreenect2::Registration::apply() to align them, 3) loop through 512x424 matrix and call libfreenect2::Registration::getPointXYZRGB() to fill all the matrix elements.

According to the performance testing result, step 3# was the one taking most of time in the whole pipeline. I was thinking to use parallel programming (either CUDA with GPU or multiple thread on CPU) in Step 3# to improve the efficiency. However, considering libfreenect2 has already providing CUDA/OpenGL options of pipelines etc. , it might looks helpful if libfreenect2 could provide one more function along aside with getPointXYZRGB() using CUAD etc. to generate all the points, i.e. adding libfreenect2::Registration::getPointXYZRGB(const Frame* undistorted, const Frame* registered, float* depth, uint_8* color), where 'depth' point to a 3x512x424 array of float representing point (X, Y, Z) matrix and 'color' point to a 3x512x424 array of uinit_8 representing color for related point.

Would you please kindly share your comments/thoughts about this?

Thanks in advance!

xlz commented 7 years ago

It's worth trying.

You probably want to create a class CudaRegistration in cuda_registration.cpp and registration.h.

To interoperate with PCL, you'll need to look at the byte format of PCL. It looks like a pointcloud in PCL stores data in vector<pcl::PointXYZRGB> or vector<pcl::PointXYZ>. pcl::PointXYZ is 4 bytes. pcl::PointXYZRGB is 8 bytes.

fengjim commented 7 years ago

@xlz, thanks a lot for your reply and suggestions.

I will have a try then:)

hanshammel1337 commented 7 years ago

@fengjim Did you make any progress on this topic? I am also interested in it.

fengjim commented 7 years ago

Hi @hanshammel1337 , umm, I haven't done any real progress other than checking GPU related materials.

You may go ahead to start it over:) would be appreciated that you can share your branch later after you start working on it.

blackzafiro commented 7 years ago

Hi, are there any news about this? I will need this feature, so I will be working on something of the sort for the next weeks. I don't have much experience with CUDA, so I'll be a bit slow, but if I can help just let me know.

aosewski commented 7 years ago

I'm trying to use libfreenect library on Jetson Tk1, however I've got terrible performance when receiving both rgb and depth data with protonect, it's simply unusable. So when I encountered this thread I thought I could give it a try to port some part on GPU. My first step to do this was an attempt to obtain an application profile with callgrind. However that ended with failure. Every time I run callgrind I receive no information because application hangs down. Here is an output I get:

[ubuntu@tegra-ubuntu bin]$ valgrind --tool=callgrind ./bin/Protonect -noviewer cuda
==5048== Callgrind, a call-graph generating cache profiler
==5048== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et al.
==5048== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==5048== Command: ./bin/Protonect -noviewer cuda
==5048== 
==5048== For interactive control, run 'callgrind_control -h'.
^C==5048== 
==5048== Events    : Ir
==5048== Collected : 581231764
==5048== 
==5048== I   refs:      581,231,764

[ubuntu@tegra-ubuntu build]$ valgrind --tool=callgrind ./bin/Protonect -noviewer     
==5077== Callgrind, a call-graph generating cache profiler
==5077== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et al.
==5077== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==5077== Command: ./bin/Protonect -noviewer
==5077== 
==5077== For interactive control, run 'callgrind_control -h'.
Version: 0.2.0
Environment variables: LOGFILE=<protonect.log>
Usage: ./bin/Protonect [-gpu=<id>] [gl | cl | clkde | cuda | cudakde | cpu] [<device serial>]
        [-noviewer] [-norgb | -nodepth] [-help] [-version]
        [-frames <number of frames to process>]
To pause and unpause: pkill -USR1 Protonect
^C==5077== 
==5077== Events    : Ir
==5077== Collected : 367982872
==5077== 
==5077== I   refs:      367,982,872
Killed
[ubuntu@tegra-ubuntu build]$ valgrind --tool=callgrind ./bin/Protonect -noviewer cuda -nodepth
==5079== Callgrind, a call-graph generating cache profiler
==5079== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et al.
==5079== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==5079== Command: ./bin/Protonect -noviewer cuda -nodepth
==5079== 
==5079== For interactive control, run 'callgrind_control -h'.
Version: 0.2.0
Environment variables: LOGFILE=<protonect.log>
Usage: ./bin/Protonect [-gpu=<id>] [gl | cl | clkde | cuda | cudakde | cpu] [<device serial>]
        [-noviewer] [-norgb | -nodepth] [-help] [-version]
        [-frames <number of frames to process>]
To pause and unpause: pkill -USR1 Protonect
^C==5079== 
==5079== Events    : Ir
==5079== Collected : 355734782
==5079== 
==5079== I   refs:      355,734,782
Killed
[ubuntu@tegra-ubuntu build]$ valgrind --tool=callgrind ./bin/Protonect -noviewer cuda -norgb  
==5082== Callgrind, a call-graph generating cache profiler
==5082== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et al.
==5082== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==5082== Command: ./bin/Protonect -noviewer cuda -norgb
==5082== 
==5082== For interactive control, run 'callgrind_control -h'.
Version: 0.2.0
Environment variables: LOGFILE=<protonect.log>
Usage: ./bin/Protonect [-gpu=<id>] [gl | cl | clkde | cuda | cudakde | cpu] [<device serial>]
        [-noviewer] [-norgb | -nodepth] [-help] [-version]
        [-frames <number of frames to process>]
To pause and unpause: pkill -USR1 Protonect
^C==5082== 
==5082== Events    : Ir
==5082== Collected : 244213827
==5082== 
==5082== I   refs:      244,213,827

When I ran Protonect with ./Protonect -noviewer cuda -norgb or ./Protonect -noviewer cuda -nodepth then program works as expected, yielding quite good performance about 80-90 fps. Whereas when running ./Protonect -noviewer cuda I get information only about skipped packets:

...
[Debug] [DepthPacketStreamParser] skipping depth packet
[Debug] [RgbPacketStreamParser] skipping rgb packet!
...

I compile library with following command:

cmake .. -DENABLE_CXX11=ON -DCMAKE_INSTALL_PREFIX=/usr/local/lib/freenect2 && make -j2 && sudo make install

I've also tried to use gprof by passing -DCMAKE_CXX_FLAGS=-pg, also with no luck.

I'd be very grateful for any information that could help me to obtain application profile.

xlz commented 7 years ago

Jetson TK1's CPU is slow. Protonect -noviewer cuda still does registration which is slow on CPU.

If you really want you can use perf tool but you have to build it from source and there isn't much useful information. The most useful indicator would be CPU usage per thread and I expect the main thread has the highest because it does on CPU registration.

So there isn't much you can do except commenting out registration in Protonect.cpp. Jetson TK1 is barely capable enough of handling Kinect and it takes careful optimization. If Cuda registration is done this might get better.

blackzafiro commented 7 years ago

I cloned this repository and finished a first implementation of registration apply method, depth looks good but I haven't had the chance to check the color registered frame. I'll let you know as soon as it looks presentable.

blackzafiro commented 7 years ago

It works!

cudaregistration_1

xlz commented 7 years ago

Make it a PR.

blackzafiro commented 7 years ago

Ok, I only implemented the function apply. Should I finish the other ones that work in parallel or PR now?

xlz commented 7 years ago

You can create a PR for us to see and amend it with new commits later.

mlaz commented 7 years ago

Hi, I am also interested in this, I am currently running a cuda kernel which does the registration and I am looking to pass the rgb buffer (rgb->data) to this kernel using zerocopy. This memory region apparently is allocated by the gst-jpeg library for tegra provided by nvidia (in my case using Jetson TK1) on NVMM. Is there any way to do this without having to copy the whole memory region to a pinned memory region?

xlz commented 7 years ago

Is there any way to do this

Okay, this is fairly complicated. In terms of TK1 the ideal way is zero-copy, i.e. not even cudaMemcpy(). There is something some unified virtual addressing supported by TK1 but I haven't figured out how to make this paradigm portable on platforms without such support without making a mess of code.

One-copy is also possible. The memory jpegTegraMgr->buff[0] allocated by Nvidia internal libraries is probably page-locked but I'm not sure. cudaHostRegister() can turn a host memory pointer into "pinned memory", which is what you're asking for, but I'm not sure if it works or what will happen if you page lock the same address twice.

In short, try cudaHostRegister() first. You want to cudaHostRegister() just once and see what happens ~~~https://github.com/OpenKinect/libfreenect2/blob/master/src/tegra_jpeg_rgb_packet_processor.cpp#L147~~~

Sorry, you need another way to enforce cudaHostRegister() being called on the same address just once.

mlaz commented 7 years ago

I already tried to cudaHostRegister() the memory region but it looks like it is not supported on ARM platforms, according to this thread: https://devtalk.nvidia.com/default/topic/998962/cuda-zero-copy-on-tx1/

xlz commented 7 years ago

We don't have any control on how it is allocated. The part is not open source.

Have you tried AastaLLL's example? Just start with cudaSetDeviceFlags(cudaDeviceMapHost); and cudaHostGetDevicePointer() on jpegTegraMgr->buff[0] without the malloc part.

mlaz commented 7 years ago

I am already doing cudaSetDeviceFlags(cudaDeviceMapHost) and also or'ed cudaHostAllocMapped on the CudaAllocator's cudaHostAlloc() flags, which (unexpectedly) allowed me to use the allocated data regions (i.e. depth->c_map) without needing to cudaHostGetDevicePointer(). I will do some more testing, just to make sure but I recall not being able to do this with rgb->data, which means this might be page-locked memory but not mapped on the device.

To clarify: I only tried cudaHostGetDevicePointer() after cudaHostRegister() on rgb->data, which returned an error.

xlz commented 7 years ago

unexpectedly

It's exactly the unified virtual addressing on TK1. But I can't make this portable yet.

page-locked memory but not mapped on the device

I guess the secret sauce is how to map it to the device.

But if it's already page locked then the "the caching attribute of an existing allocation can't be changed on the fly" issue of cudaHostRegister() not being supported is no longer relevant because it only needs to do the mapping part.

blackzafiro commented 7 years ago

PR #822

OpenKinect / libfreenect2

CUDA support for registration #744