TUD-UCB-Boda / boda

clone/fork of public boda repo
Other
0 stars 0 forks source link

First step to a Vulkan-backend #3

Open johannesschulte opened 7 years ago

johannesschulte commented 7 years ago

This is the first running version of the Vulkan-backend. We are able to run the SGEMM kernel with it and seem be producing correct results ( or at least "boda cnn_op_info --cnn-func-sigs-fn='%(boda_test_dir)'/sgemm-ops-debug.txt --gen-data='(str_vals=(type=gen_data),nda_vals=(vi=(tn=float,v=0.0),mode=(tn=uint32_t,v=600)))' --rtc='(be=vk)' --rtc-comp='(be=ocl)' " returns without complaining). The patch itself is quite messy, I hope the comments give an idea of what needs to be done to improve this. I've tried to keep changes to files other than vk_util.cc to a minimum and implemented the backend in the most naive way for now. This results in a quite big overhead, because we can't really leverage the Vulkan-API and need to do many things in a sub-optimal way (e.g. we need to create a command-buffer for every kernel execution as well as every buffer-copy between the host and the device). There is a lot of room for optimizations in this regard, but they require more invasive changes to rest of the rtc-component, so I didn't do them for now.

johannesschulte commented 7 years ago

With all these commits, the picture has changed quite a bit. Most of the work focused on doing more in the compile() function of the backend and less on every call to run(). Note that this in part requires, that we know the local workgroup size on compile-time. This is the case for SGEMM but I don't know, whether that's always the case. For SGEMM I get the the following runtimes: https://pastebin.com/zbZpAhMS . So, compilation just seems to be slower compared to OpenCL. I've profiled the compile function on my laptop and it spends virtually all of its time in two functions: the function compiling GLSL to SPIR-V (from the shaderc library) and in a function called vkCreateComputePipeline. I'm not sure, what the second one exactly does but what makes sense is, that here the SPIR-V is compiled and optimized to architecture specific assembly. Either way, we can't really do anything about these functions, so long compilation times seem unavoidable for now with the Vulkan backend (in this case twice as long as the OpenCL backend). On the other hand it's very promising, that actually running the kernels is significantly faster in all cases.

moskewcz commented 7 years ago

i'll need to review this is more detail to say more, but i can answer a few quick questions here:

johannesschulte commented 7 years ago

We've decided to first try running full-nets with the convolution kernel, before trying the im2col+SGEMM-based approach. With this commit, I'm now able run run_cnet with nin_imagenet. Some performance number are here: https://pastebin.com/5HDD7nmj I'll be busy with exams till the end of September, but then the next step is profiling this again and trying to optimize the overall runtime, e.g. by recording multiple kernel invocations into a single commandBuffer.

johannesschulte commented 7 years ago

Yeah, that's just a lazy hack to be able to estimate kernel runtime with timer_t (did the same in OpenCL and Vulkan). But you're right, I should just finally implement GPU-timers in Vulkan and then use the event-based system for measuring runtime.