Open katyo opened 11 years ago
No, but that's not the biggest problem. Such a change would be a relatively long-time effort, because it would involve changing all the ocl_*.c sources. But that's probably for good. What I am really concerned about is that it would become a problem when I start working on distributed attacks. My idea is to implement something like a lightweight version of VirtualCL, where only the needed subset of the OpenCL functionality is implemented using a custom networking protocol that is optimized for hashkill's needs (mostly rule attack optimizations).
In that case, cached precompiled kernels management becomes problematic. Where do we compile the source? Correct answer would be on each node. This however would involve transferring the .cl sources that are much larger in size as compared to precompiled, compressed binaries. Also the time needed to build them on each host would be high.Another approach would be build them on master host, then transfer the binaries to slaves, but that is harder to implement and still involves some issues.
That's probably the most significant reason I am yet not eager to switch to cached binaries.
Today we have many types of OpenCL architectures and platforms, so we must compile all programs for each platform. I made some estimation and concluded that total overhead is too big.
Source (uncompressed):
$ ls src/kernels/*.cl | wc -l
198
$ wc -c src/kernels/*.cl | grep 'total$'
10736247 total
198 sources ~11Mb
Source (compressed with xz):
$ tar -cJf - src/kernels/*.cl | wc -c
118212
~118Kb
Binary (all types of platforms):
$ ls src/kernels/compiler/*.{bin,ptx} | wc -l
2409
$ wc -c src/kernels/compiler/*.{bin,ptx} | grep 'total$'
28532285 total
2400 binaries ~29Mb
Withal, usually we don't need compile for all CL platforms on real workstation, where the program may be run. In common case it's 1-3 different OpenCL platforms according to hardware.
You are talking about disk space overhead or the eventual networking overhead when clustering?
It depend from that, what understand a clustering for us. May be we talk about different things…
In my perception, each machine in cluster have its own CL hardware, in common case it different from other. I think, that each element of cluster may wish compile only needed OpenCL programs for hardware platforms, which it has, and only at first time as soon as corresponding programs will required. It isn't a big runtime overhead, if caching used.
Correct and there goes the problem. The VCL approach works by queueing OpenCL commands to remote hosts then receive the function output via the net. The master host "sees" remote hosts' devices as local GPUs and to hashkill that's transparent (there is a "translation" layer which queues requests to remote hosts).
Now the first problem is that remote hosts do not share a common storage with the master host, thus the master host does not "know" whether the kernel binary is cached on the remote host. The protocol needs to be extended t account for that, but whatever you do, hashkill itself would need to be changed to accomodate for such change.
The other problem is related to proper error handling in the context of clustered hashkill. Building from source is always more prone to different OS and driver-related issues as compared to loading precompiled kernels. Nodes are likely going to run different driver versions, linux flavors, etc. What strategy should we take if we have a node that failed to compile a kernel from source? This question is hard even in case precompiled binaries were sent though.
Having prebuilt binaries on master node also guarantees all the nodes will execute one and the same code, which will eliminate some issues originating from differing opencl runtime implementations. The OpenCL compiler frontend itself (at least with AMD) is notably buggy, with crashes and wrong binaries produced occuring quite often. It is hard to base on a stable version cause there is no stable version. What we can do is focus on the best one and try to workaround compiler bugs (which sometimes even require disabling compiler optimizations or changing the code in funny way).
Well unfortunately it gets quite complicated when you try to distribute that. In fact I still have a lot of unresolved design dilemmas regarding that :(
There is no gain in "on-demand"-compiling. Compile once and dsitribute the Kernels is always faster then "on-demand" compiling even if you've a cashing system. Compiling some kernels can take quite a while and gat3way also explained drawbacks in the SDKs. I would vote against "on demand" compiling, it was removed long time ago. You wont wait minutes before hashkill starts just because you like to crack a rar-file... (not knowing if the compiler is stuck or your computer crashed...).
You can apparently support offline compiling for all AMD cards, including ones that aren't plugged in, with a build change.
http://devgurus.amd.com/thread/153189 http://devgurus.amd.com/thread/166543 http://developer.amd.com/resources/documentation-articles/knowledge-base/?ID=115
Offline devices compilation is used already - this is how we build for AMD. For NVidia unfortunately it's not that easy and you can compile only for architectures up to that which is already available on system (e.g if you have a sm_20 gpu, you can build sm_1x and sm_20 but not sm_21 and sm_30 obviously).
Maybe this will help: https://github.com/ljbade/clcc
Nope, it does the same (limitations are in the NV ocl runtime)
Damn, that sucks. Is there a way to do it at runtime?
In common case I can't build both nvidia and amd compatible hashkill on my buildhost without nvidia hardware. Building all kernels for all hardware platforms from sources is too long process on my laptop (over three hours). I would like to implement runtime compilation of OpenCL programs with caching compiled code as configure option, but I see a few problems here. First, can compiling flags be moved to shader source?