gat3way / hashkill

hashkill password recovery tool
www.gat3way.eu/hashkill
Other
196 stars 47 forks source link

Compiling kernels at runtime (with caching) #31

Open katyo opened 11 years ago

katyo commented 11 years ago

In common case I can't build both nvidia and amd compatible hashkill on my buildhost without nvidia hardware. Building all kernels for all hardware platforms from sources is too long process on my laptop (over three hours). I would like to implement runtime compilation of OpenCL programs with caching compiled code as configure option, but I see a few problems here. First, can compiling flags be moved to shader source?

gat3way commented 11 years ago

No, but that's not the biggest problem. Such a change would be a relatively long-time effort, because it would involve changing all the ocl_*.c sources. But that's probably for good. What I am really concerned about is that it would become a problem when I start working on distributed attacks. My idea is to implement something like a lightweight version of VirtualCL, where only the needed subset of the OpenCL functionality is implemented using a custom networking protocol that is optimized for hashkill's needs (mostly rule attack optimizations).

In that case, cached precompiled kernels management becomes problematic. Where do we compile the source? Correct answer would be on each node. This however would involve transferring the .cl sources that are much larger in size as compared to precompiled, compressed binaries. Also the time needed to build them on each host would be high.Another approach would be build them on master host, then transfer the binaries to slaves, but that is harder to implement and still involves some issues.

That's probably the most significant reason I am yet not eager to switch to cached binaries.

katyo commented 11 years ago

Today we have many types of OpenCL architectures and platforms, so we must compile all programs for each platform. I made some estimation and concluded that total overhead is too big.

Source (uncompressed):

$ ls src/kernels/*.cl | wc -l
198
$ wc -c src/kernels/*.cl | grep 'total$'
10736247 total

198 sources ~11Mb

Source (compressed with xz):

$ tar -cJf - src/kernels/*.cl | wc -c
118212

~118Kb

Binary (all types of platforms):

$ ls src/kernels/compiler/*.{bin,ptx} | wc -l
2409
$ wc -c src/kernels/compiler/*.{bin,ptx} | grep 'total$'
28532285 total

2400 binaries ~29Mb

Withal, usually we don't need compile for all CL platforms on real workstation, where the program may be run. In common case it's 1-3 different OpenCL platforms according to hardware.

gat3way commented 11 years ago

You are talking about disk space overhead or the eventual networking overhead when clustering?

katyo commented 11 years ago

It depend from that, what understand a clustering for us. May be we talk about different things…

In my perception, each machine in cluster have its own CL hardware, in common case it different from other. I think, that each element of cluster may wish compile only needed OpenCL programs for hardware platforms, which it has, and only at first time as soon as corresponding programs will required. It isn't a big runtime overhead, if caching used.

gat3way commented 11 years ago

Correct and there goes the problem. The VCL approach works by queueing OpenCL commands to remote hosts then receive the function output via the net. The master host "sees" remote hosts' devices as local GPUs and to hashkill that's transparent (there is a "translation" layer which queues requests to remote hosts).

Now the first problem is that remote hosts do not share a common storage with the master host, thus the master host does not "know" whether the kernel binary is cached on the remote host. The protocol needs to be extended t account for that, but whatever you do, hashkill itself would need to be changed to accomodate for such change.

The other problem is related to proper error handling in the context of clustered hashkill. Building from source is always more prone to different OS and driver-related issues as compared to loading precompiled kernels. Nodes are likely going to run different driver versions, linux flavors, etc. What strategy should we take if we have a node that failed to compile a kernel from source? This question is hard even in case precompiled binaries were sent though.

Having prebuilt binaries on master node also guarantees all the nodes will execute one and the same code, which will eliminate some issues originating from differing opencl runtime implementations. The OpenCL compiler frontend itself (at least with AMD) is notably buggy, with crashes and wrong binaries produced occuring quite often. It is hard to base on a stable version cause there is no stable version. What we can do is focus on the best one and try to workaround compiler bugs (which sometimes even require disabling compiler optimizations or changing the code in funny way).

Well unfortunately it gets quite complicated when you try to distribute that. In fact I still have a lot of unresolved design dilemmas regarding that :(

r3mbr4ndt commented 11 years ago

There is no gain in "on-demand"-compiling. Compile once and dsitribute the Kernels is always faster then "on-demand" compiling even if you've a cashing system. Compiling some kernels can take quite a while and gat3way also explained drawbacks in the SDKs. I would vote against "on demand" compiling, it was removed long time ago. You wont wait minutes before hashkill starts just because you like to crack a rar-file... (not knowing if the compiler is stuck or your computer crashed...).

peterclemenko commented 11 years ago

You can apparently support offline compiling for all AMD cards, including ones that aren't plugged in, with a build change.

http://devgurus.amd.com/thread/153189 http://devgurus.amd.com/thread/166543 http://developer.amd.com/resources/documentation-articles/knowledge-base/?ID=115

gat3way commented 11 years ago

Offline devices compilation is used already - this is how we build for AMD. For NVidia unfortunately it's not that easy and you can compile only for architectures up to that which is already available on system (e.g if you have a sm_20 gpu, you can build sm_1x and sm_20 but not sm_21 and sm_30 obviously).

peterclemenko commented 11 years ago

Maybe this will help: https://github.com/ljbade/clcc

gat3way commented 11 years ago

Nope, it does the same (limitations are in the NV ocl runtime)

peterclemenko commented 11 years ago

Damn, that sucks. Is there a way to do it at runtime?