JuliaGPU / KernelAbstractions.jl

Heterogeneous programming in Julia
MIT License
364 stars 65 forks source link

How can we make KA fast on CPUs? #509

Open avik-pal opened 1 month ago

avik-pal commented 1 month ago

See https://github.com/LuxDL/LuxLib.jl/pull/136 for some background context. The main motivation for me is to avoid code duplication between CPU and GPU versions. However, if you take a look at the benchmark comment on the PR (for batchnorm and groupnorm) you see somewhere between a 10x-40x slowdown between KA and the equivalent optimized loop version (note that it is simply using @simd or @simd ivdep and nothing like LoopVectorization).

I think there are a couple of reasons for the slowdown:

  1. @simd annotations are missing (which causes slowdown even in the loop version if I remove the annotations)
  2. threading has overhead for some of the smaller problems

Potential solutions:

  1. Allow users to control threading. #507. For smaller problems, I want to opt out of threading manually.
  2. @simd annotations (#436 seems to do this. not sure what is the status for that)
  3. Alternate threading: KA is being used inside "core" operations. As such we are unlikely (if not impossible) to call other operations that make use of threading. Hence, having the option to use "cheaper threads" (Polyester.jl) would be a great addition
vchuravy commented 4 weeks ago

Does #436 speed things up for you? I haven't merged it since I haven't seen an impact on benchmarks.

avik-pal commented 3 weeks ago

Let me run the benchmarks with that branch and check.

EDIT: doesn't seem to help

vchuravy commented 3 weeks ago

So the question would be where are the overheards coming from, so maybe you cana run a profile and compare the time being spend. You could also use static scheduling.

Could you isolate a particular case where you are seeing these overheads?

maleadt commented 1 day ago

A MWE would be very useful, as the upcoming POCL CPU back-end may be interesting for performance, but hasn't been benchmarked.

avik-pal commented 1 day ago

Oops this fell off my radar, I will create a self contained example this week

avik-pal commented 23 hours ago

(Independent of this) once POCL is ready and I can access a ci server with that, I can trigger the benchmark suite for NN primitives in LuxLib to generate results in https://luxdl.github.io/LuxLib.jl/benchmarks/