First step to a Vulkan-backend

johannesschulte commented 7 years ago

This is the first running version of the Vulkan-backend. We are able to run the SGEMM kernel with it and seem be producing correct results ( or at least "boda cnn_op_info --cnn-func-sigs-fn='%(boda_test_dir)'/sgemm-ops-debug.txt --gen-data='(str_vals=(type=gen_data),nda_vals=(vi=(tn=float,v=0.0),mode=(tn=uint32_t,v=600)))' --rtc='(be=vk)' --rtc-comp='(be=ocl)' " returns without complaining). The patch itself is quite messy, I hope the comments give an idea of what needs to be done to improve this. I've tried to keep changes to files other than vk_util.cc to a minimum and implemented the backend in the most naive way for now. This results in a quite big overhead, because we can't really leverage the Vulkan-API and need to do many things in a sub-optimal way (e.g. we need to create a command-buffer for every kernel execution as well as every buffer-copy between the host and the device). There is a lot of room for optimizations in this regard, but they require more invasive changes to rest of the rtc-component, so I didn't do them for now.

johannesschulte commented 7 years ago

With all these commits, the picture has changed quite a bit. Most of the work focused on doing more in the compile() function of the backend and less on every call to run(). Note that this in part requires, that we know the local workgroup size on compile-time. This is the case for SGEMM but I don't know, whether that's always the case. For SGEMM I get the the following runtimes: https://pastebin.com/zbZpAhMS . So, compilation just seems to be slower compared to OpenCL. I've profiled the compile function on my laptop and it spends virtually all of its time in two functions: the function compiling GLSL to SPIR-V (from the shaderc library) and in a function called vkCreateComputePipeline. I'm not sure, what the second one exactly does but what makes sense is, that here the SPIR-V is compiled and optimized to architecture specific assembly. Either way, we can't really do anything about these functions, so long compilation times seem unavoidable for now with the Vulkan backend (in this case twice as long as the OpenCL backend). On the other hand it's very promising, that actually running the kernels is significantly faster in all cases.

moskewcz commented 7 years ago

i'll need to review this is more detail to say more, but i can answer a few quick questions here:

in general, workgroup size can be fixed at compile time, and in general for any performance critical kernel it will be. however, i think i may optional allow dynamic kernel sizes for for cases where static kernel sizes are not necessary, so that might require some effort to deal with -- either by somehow supporting it in vulkan (bearing in mind it's for cases that aren't perf. critical), or by somehow disallowing it in the vulkan case. however, the caution is that this feature may be employed specifically to reduce compilation / multiple-function-related overhead, so it might be a bit tricky to support it in a way that doesn't defeat the purpose (although perhaps some perf. degradation in these cases is acceptable).
i'm not sure we're quite comparing the right timings yet. compilation time and the overall runtime of the profiling flow are both important, but neither one of the those quite approximates the 'true' runtime of an application-level flow (i.e. like full-net running.) so we may want/need an SGEMM-based-full-net-runner for benchmarking purposes.

johannesschulte commented 7 years ago

We've decided to first try running full-nets with the convolution kernel, before trying the im2col+SGEMM-based approach. With this commit, I'm now able run run_cnet with nin_imagenet. Some performance number are here: https://pastebin.com/5HDD7nmj I'll be busy with exams till the end of September, but then the next step is profiling this again and trying to optimize the overall runtime, e.g. by recording multiple kernel invocations into a single commandBuffer.

johannesschulte commented 7 years ago

Yeah, that's just a lazy hack to be able to estimate kernel runtime with timer_t (did the same in OpenCL and Vulkan). But you're right, I should just finally implement GPU-timers in Vulkan and then use the event-based system for measuring runtime.

TUD-UCB-Boda / boda

First step to a Vulkan-backend #3