Recursive version of multithreading

Our current implementation is rather simplistic and naive, just split up matrix in equal chunks and upload them to different threads. But this is not the way how it was intended to be: https://julialang.org/blog/2019/07/multithreading/

The general idea, how it should be implemented is to write a recursive function, which splits matrix in half and recursive call itself. Upon hitting some limit (<1000 columns for example?) actual procedure should commence.

This approach has its benefits, for example, there would be no penalty for multithreading small matrices since the algorithm wouldn't start new threads in this case. Also, it helps to remove MultiThreading/SingleThread modes completely. We should implement this approach and benchmark it properly.

PyDataBlog / ParallelKMeans.jl

Recursive version of multithreading #23