autumnai / collenchyma

Extendable HPC-Framework for CUDA, OpenCL and common CPU
http://autumnai.github.io/collenchyma
Apache License 2.0
475 stars 35 forks source link

Reduce overhead of using libraries #13

Open hobofan opened 8 years ago

hobofan commented 8 years ago

We currently still have a siginificant overhead when compared to directly calling a library implementation. As far as I can tell from profiling most of that overhead is due to dynamic dispatch, which in some cases might only be removed with some bigger restructuring of library.

Any input on where/how performance can be improved is highly appreciated! :)

bklooste commented 8 years ago

I dont know how rust does it but virtual dynamic dispatch to a shared/dynamic lib in C++ does a lookup on each call. Good JIT run times use polymorphic inline caches to avoid this at runtime but obviously that's not an option. Your only options are to ensure the calls happen rarely eg chunky calls/send multiple commands , change the library architecture or possibly static link the lib .

hobofan commented 8 years ago

@bklooste : I am also not too sure how Rust handles that, but I think that maybe LTO (Link time optimization) already does something like that. At least I haven't really seen any significant overhead for that. As for static linking the lib, that should generally be possible to do in the relevant plugins (cudnn in -nn and cublas in -blas should support static linking).

What I was originally meant with dynamic dispatch was the one in Rust as explained in this part of the Rust book.

bklooste commented 8 years ago

Looking at +1000x Dot product of two vectors of size 100 | 48,870 ns (+/- 499) | 15,226 ns (+/- 244)

The cost of dynamic dispatch is not typically huge , its just a static indexed lookup for the right method . However it can suffer 1) Extremely tight method like a micro-bench can be many multiple small calls. You can certainly make 100M virt calls in a second. Here is the typical cost as you can see its not high ( 2 moves and an index )

this.v1(); 00000012 8B CE mov ecx,esi 00000014 8B 01 mov eax,dword ptr [ecx] ; fetch method table address 00000016 FF 50 38 call dword ptr [eax+38h] ; fetch/call method address

2) Inlining nor whole program optimization is not possible . 3) You cant use link time optimization for virts , in fact shared libraries need a extra lookup ( sometimes a hash) (not just rust but C++ ) , http://eli.thegreenplace.net/2013/12/05/the-cost-of-dynamic-virtual-calls-vs-static-crtp-dispatch-in-c. This can be 20% of program execution . However i was wrong with rust as rust libs tend to be compiled whole program eg they are not shared libs.

  1. Is IMHO not the case but it seems the only one , maybe something rust specific , i will have a look at the assem for the build.
  2. Should not be a factor since the loops should be in the called library,
  3. Static linking should improve but both should have the same overhead on both tests.