UCL-RITS / pi_examples

A lot of ways to run the same way of calculating pi. Some of them are dumb.
Creative Commons Zero v1.0 Universal
29 stars 8 forks source link

[julia-ipu] Do manual loop unrolling for better performance in some cases #18

Closed giordano closed 1 year ago

giordano commented 1 year ago

On Mandelbrot we have a znver2 CPU and Julia isn't able to do aggressive loop unrolling when targeting the host, luckily we can do manual loop unrolling quite easily with the Base.Cartesian.@nexprs macro:

[cceamgi@mandelbrot julia_ipu_pi_dir]$ julia pi.jl 1
[ Info: Trying to attach to device 0...
[ Info: Successfully attached to device 0
✓ Compiling codelet Pi:          Time: 0:00:04
Calculating PI using:
  4294966272 slices
  1472 IPU tiles
  loop unroll factor 1
Obtained value of PI: 3.1499734
Time taken: 0.1325 seconds (245093526 cycles at 1.85 GHz)
[cceamgi@mandelbrot julia_ipu_pi_dir]$ julia pi.jl 2
[ Info: Trying to attach to device 0...
[ Info: Successfully attached to device 0
✓ Compiling codelet Pi:          Time: 0:00:04
Calculating PI using:
  4294966272 slices
  1472 IPU tiles
  loop unroll factor 2
Obtained value of PI: 3.1499734
Time taken: 0.0899 seconds (166313574 cycles at 1.85 GHz)
[cceamgi@mandelbrot julia_ipu_pi_dir]$ julia pi.jl 4
[ Info: Trying to attach to device 0...
[ Info: Successfully attached to device 0
✓ Compiling codelet Pi:          Time: 0:00:04
Calculating PI using:
  4294966272 slices
  1472 IPU tiles
  loop unroll factor 4
Obtained value of PI: 3.1499734
Time taken: 0.08517 seconds (157560252 cycles at 1.85 GHz)
[cceamgi@mandelbrot julia_ipu_pi_dir]$ julia pi.jl 8
[ Info: Trying to attach to device 0...
[ Info: Successfully attached to device 0
✓ Compiling codelet Pi:          Time: 0:00:04
Calculating PI using:
  4294966272 slices
  1472 IPU tiles
  loop unroll factor 8
Obtained value of PI: 3.1499734
Time taken: 0.0828 seconds (153183588 cycles at 1.85 GHz)
[cceamgi@mandelbrot julia_ipu_pi_dir]$ julia pi.jl 12
[ Info: Trying to attach to device 0...
[ Info: Successfully attached to device 0
✓ Compiling codelet Pi:          Time: 0:00:04
Calculating PI using:
  4294966272 slices
  1472 IPU tiles
  loop unroll factor 12
Obtained value of PI: 3.1499734
Time taken: 0.07492 seconds (138594708 cycles at 1.85 GHz)

Note: this requires https://github.com/JuliaRegistries/General/pull/87432 to be merged in a couple of days.

giordano commented 1 year ago

IPUToolkit.jl is now registered, so this is ready to go.