Open avik-pal opened 1 month ago
Does #436 speed things up for you? I haven't merged it since I haven't seen an impact on benchmarks.
Let me run the benchmarks with that branch and check.
EDIT: doesn't seem to help
So the question would be where are the overheards coming from, so maybe you cana run a profile and compare the time being spend. You could also use static scheduling.
Could you isolate a particular case where you are seeing these overheads?
A MWE would be very useful, as the upcoming POCL CPU back-end may be interesting for performance, but hasn't been benchmarked.
Oops this fell off my radar, I will create a self contained example this week
(Independent of this) once POCL is ready and I can access a ci server with that, I can trigger the benchmark suite for NN primitives in LuxLib to generate results in https://luxdl.github.io/LuxLib.jl/benchmarks/
See https://github.com/LuxDL/LuxLib.jl/pull/136 for some background context. The main motivation for me is to avoid code duplication between CPU and GPU versions. However, if you take a look at the benchmark comment on the PR (for
batchnorm
andgroupnorm)
you see somewhere between a 10x-40x slowdown between KA and the equivalent optimized loop version (note that it is simply using@simd
or@simd ivdep
and nothing like LoopVectorization).I think there are a couple of reasons for the slowdown:
@simd
annotations are missing (which causes slowdown even in the loop version if I remove the annotations)Potential solutions:
@simd
annotations (#436 seems to do this. not sure what is the status for that)