Re-thinking the auto-tuning

CNugteren commented 7 years ago

Hereby I want to open a discussion on improving and changing the currently tuning infrastructure and methods for CLBlast. There are a couple of drawbacks pointed out by various users of the library:

Using the OpenCL device name can sometimes by too generic. For example in the case of AMD GPUs there are many different varieties of e.g. the Hawaii architecture. The tuner currently does capture other properties (e.g. clock speed and number of compute units), so it is possible to further sub-divide the tuning parameters for specific devices. See also the remarks made by @ulfgard and @blueberry in issue #1.
Using the OpenCL device name can sometimes by too specific. For example in the case of NVIDIA GPUs or Intel CPUs it would be beneficial to share tuning results across devices from the same architecture, e.g. Maxwell or Skylake in case there are no tuning results yet for a specific device (e.g. GTX970 should have similar optimal parameters compared to a GTX980). This should be possible with the current infrastructure, it would just require a look-up-table of device names to architectures. This was also pointed out earlier on by @psyhtest if I recall correctly.
The current tuner CLTune cannot tune multi-kernel routines such as DOT/MIN/MAX/SUM. The current tuner cannot tune host-parameters as well, such as deciding which GEMM kernel to choose (direct versus in-direct). This is currently done manually. Another much simpler to use tuner written in Python can actually do both.
The tuning data is for a specific sized input with fixed parameters (e.g. 1024x1024x1024 for GEMM). There are possibilities to tune across sizes by learning a model as done in ISAAC (see PDF or video of a presentation), but that is quite involved. Of course we can also tune for 2 or 3 cases, but is that sufficient? What about rectangular matrices?
GEMM is nowadays used a lot in deep learning applications to express convolutions. CLBlast now also supports a batched GEMM, see #168 and #95. However, this might require tuning for more specific use-cases regarding deep-learning applications. Perhaps it is possible to create a specific deep-learning interface to GEMM or GEMM batched and have separately tuned parameters for it? Not sure. An alternative would be to implement & tune a dedicated convolution implementation as done by @ulfgard in Remora.

Any feedback and ideas are welcome!

fvella commented 7 years ago

Hi @CNugteren we are already working on a couple of points that you already mentioned. Actually our solution extends CLBLAST in order to address the point 3. 4. 5.

Ulfgard commented 7 years ago

Hi,

I would not start too much action based on my words. I first have to figure out what is wrong with my setup before I can proceed. I will play around with tuning of the copy matrix kernels and if this does not solve my performance problems, i consider my device/setup/tuning as broken.

Personally i would opt for an easy way to store a tuning in a tuning file which then can be used to initilize the library at the start of the program. e.g.

clblast::loadTuning("tuning.json")

This gives every user the ability to create an optimal device/application specific tuning. This would solve 1.

CNugteren commented 7 years ago

@Ulfgard OK, fair enough. Do keep me updated on any potential issues though.

About your suggestion, that could indeed make things a bit easier to use. Another thing in my opinion would be to either integrate the CLTune code (e.g. as a git submodule), or replace it with the earlier mentioned Python tuning. In both case this reduces the dependency and the need for a separate compilation & installation of a library.

Ulfgard commented 7 years ago

Sorry for my late reply. I am currently working in a new project and devote all my time to it.

I could not get my setup to give decent results. I will try my benchmark/software on an nvidia platform next, this should give an indication whether it is my software/Benchmark, or my setup. Sorry for wasting some of your time!

CNugteren commented 6 years ago

Just a status update here on the above points:

Solved, it is possible to use the actual AMD board name.
Solved, there is now a device & architecture hierarchy.
Partially solved, there is now a GEMM routine tuner to choose between the 2 methods
Not implemented yet
Not implemented yet

kodonnell commented 6 years ago

Regarding tuning, have you considered doing this in python? Performance shouldn't (?) be a problem, as the underlying kernels are the same. It's arguably going to reach a wider audience, especially as python is the language of choice for the popular AI platforms these days. Of course, that may not be the target market, etc., but that's where I'd put my vote.

kodonnell commented 6 years ago

Following up that comment - you could also consider not doing the tuning directly, but allowing users to implement their own searches (though you'd probably have some default ones implemented). That is, out-source the parameter space searching (which is quite a complex problem for large search spaces), and keep CLBlast just focused on BLAS kernels (that are tuning-friendly). For example, the kernel_tuner project could probably be used to do nearly all of the tuning etc. That'd leave a smaller/simpler code base, and more time to work on a more optimised library etc., or adding new kernels, etc. Then again, maybe the part people enjoy is the tuning. However, for users like me, all I really care about is the underlying kernels and their performance - I don't mind how the tuning happens.

Anyway, just a thought.

CNugteren commented 6 years ago

Thanks for sharing your feedback. A few responses:

have you considered doing this in python

Yes, that is possible as well. But it is in C++ now and it works, rather not re-write everything and have to use multiple languages in one project.

It's arguably going to reach a wider audience

Not sure how that will work. People still have to compile the library with C++, right? So while they are doing that the tuner binaries are being compiled as well, so no extra effort required.

That'd leave a smaller/simpler code base, and more time to work on a more optimised library

Actually, I've done the reverse recently. The tuner was using the external CLTune project, but by integrating everything we don't have an extra dependency and don't need to maintain two code-bases. Think of things like the way kernels are compiled: that should be a 100% match between library and tuner. This way it is using the same code, quite convenient. But people are free to write their own tuners if they want of course :-)

but allowing users to implement their own searches

CLTune had search strategies implemented but they proved not so useful. But yeah, that flexibility is missing now.

But all in all currently the main issue is with tuning for different sizes on one device. There was some interesting work being done recently, and I'm also collaborating with dividiti on an attempt to solve this.

kodonnell commented 6 years ago

Cool - no problem.

Not sure how that will work.

I was more referring to a suspicion that most users interested in OpenCL will be in machine learning etc., and that most of those are python users, since that's what most platforms support. Tensorflow, for example, is C++ underneath, but nearly all users use the python API (and in most cases, users don't have to compile anything, given most things are distributed pre-compiled via conda).

But yeah, that flexibility is missing now.

Suits me - I'm happy to run a heap of combinations even if it takes a few days.

But all in all currently the main issue is with tuning for different sizes on one device

Naively, I'm planning to tune the size for every layer in my net, and use OverrrideParameters to set it each time.

kodonnell commented 6 years ago

@CNugteren - possibly related to here: how do you make the tuning database useful across releases? E.g. everyone might report their results for gemm, and then after a big code update, they're no longer valid (or are still valid, but give slow results) etc.

CNugteren commented 6 years ago

Well, the best tuning results themselves are integrated as C++ code into the library, so they automatically correspond the a certain release. In the case of tuning for older/newer releases, then there might indeed be some inconsistencies in theory. However, in practice I don't think that's the case: 1) the kernel code is pretty stable, and 2) new optimisations typically come with new parameters, thus being backwards compatible.

Worst case people have to re-tune in a new release if they want to get better performance, but they shouldn't have to re-tune the keep the same performance they already had: a certain parameter combination shouldn't lead to performance decrease in a new version.

CNugteren / CLBlast

Re-thinking the auto-tuning #169