Improve performance guo hall algorithm

gdenn commented 8 years ago

Currently the most time consuming component in nefi's algorithm section is guo hall. This algorithm already got some great optimizations through to the fact that the thinning module is written in c++. Although we could think about improving the guo hall further.

For me i think there would be two options: 1) we make extensive use of optimized data structures from numpy like narray, that would improve the performance since the memory allocation is done using c++ code.

2) we move the complete guo hall implementation in c++ and make it a blackbox which receives a image and a networkx object as container and renders the resulting graph using native code into the container. This option would be more time consuming (in the development process), but since we have to have a compiled resource in our nefi project either way (thinning module), it wouldn't change much for the user in regards to the installation process but yield a far more efficient implementation.

To give you an idea about how time consuming the guo hall thinning can be, take the wing.jpg in large version. A.junius takes for me seconds to compute the preprocessing and segmentation but around ~100 sec to compute guo hall.

There might be other components especially in the graph filtering section which need more optimization too. Ill look further to analyse the computing times and provide a table with the computation relative to my system.

adrianN commented 8 years ago

Guo-Hall can in principle be parallelized. You can split the image in chunks and run the algorithm. You only have to be careful with the pixels on the chunk boundary. This seems to me to be the easiest optimization. (It says "Parallel thinning" right in the title of the corresponding paper)

Constructing the graph from the skeleton is relatively expensive because NetworkX graphs are just not particularly performant data structures. They're build for flexibility. But there might be some potential in translating the white-pixel-bfs to c++ and returning reduced data to Python.

gdenn commented 8 years ago

If we aim for parallel computing we should more than before think about a complete c++ guo hall module since multi threading can be difficult in python.

adrianN commented 8 years ago

Giving it some more thought, Guo Hall is very easy to parallelize because you can decide for each white pixel whether it's a removable pixel independently from all other pixels. So split the image into n horizontal bars for n processors and have them work independently. No boundary conditions.

It should be easy to parallelize the existing thinning module without changing the Python interface, although I suggest piping the number of processors in as an additional parameter. This should definitely be the first step if you want to speed up graph detection by something other than micro-optimizations. The repository for the module is over at bitbucket https://bitbucket.org/adrian_n/thinning/, I accept pull requests.

To my knowledge, portable, fast multithreading in C is not exactly trivial. If possible you should make it work without a library.

phreic commented 8 years ago

I've already tried exactly this parallelization thing with different chunks in Python but it even decreased the performance. It may improve the performance if you don't run different threads but run the application itself on different cores but it's not that easy since we need to refactor the whole project structure. But even then it may not improve the performance on Python.

adrianN commented 8 years ago

Yeah, I don't think parallelizing the Python code will yield a lot of improvement. But parallelizing the C code probably will.

gdenn commented 8 years ago

Witht he c code you could theoretically also use cuda and be extremly efficient there.

adrianN commented 8 years ago

That's true, but it requires a compatible graphics card and would make the build more complicated if you want to support normal machines.

Another thing one could do is inspect the assembly generated from the C code to see whether the compiler properly vectorizes the loops.

gdenn commented 8 years ago

Ill take a closer look into it as soon as we are finished with the stuff needed to make a clean nefi 2 release.

tastyminerals commented 8 years ago

Before modifying guo_hall.py benchmark it with line profiler to see the problem regions in the code.

gdenn commented 8 years ago

So we found out that the graph edge detection is the critical part. I want to move the complete guo hall graph detection in c++ using boots.python.

I am also thinking about creating a c++ library for this purpose were we can gather other c++ implementations for e.g. graph filtering algorithms later.

If the performance is still not as good as we want it, we can think about dividing the image in parts and computing the result concurrent. Anyways, we need to take into account that a parallel solution also comes with a overhead...

05dirnbe / nefi

Improve performance guo hall algorithm #14