I added a directory with two multithreaded examples for Julia:
one uses the low-level built-in Threads.@threads macro
the other one uses the FLoops.jl package, which makes writing multi-threaded reductions simpler and less error-prone (you don't have to play with indices arithmetic, and also using SIMD instructions is way simpler).
I do a dummy call to the kernel with a small number of steps to warm up.
Benchmarks on Myriad:
$ OMP_NUM_THREADS=1 ./run.sh
Calculating PI using:
1000000000 slices
1 thread(s)
Obtained value of PI: 3.1415926535898455
Time taken: 0.8806970119476318 seconds
$ OMP_NUM_THREADS=18 ./run.sh
Calculating PI using:
1000000000 slices
18 thread(s)
Obtained value of PI: 3.141592653589797
Time taken: 0.05102109909057617 seconds
$ OMP_NUM_THREADS=1 ./run_floops.sh
# Wait for installation of packages.......
Calculating PI using:
1000000000 slices
1 thread(s)
Obtained value of PI: 3.1415926535898437
Time taken: 0.8845429420471191 seconds
$ OMP_NUM_THREADS=18 ./run_floops.sh
# Go and brew another cup of coffee........
Calculating PI using:
1000000000 slices
18 thread(s)
Obtained value of PI: 3.14159265358979
Time taken: 0.05775904655456543 seconds
For comparison, this is the benchmark of the Fortran+OpenMP example compiled with ifort:
$ OMP_NUM_THREADS=1 ./run.sh
rm -f *.o pi
make -f Makefile.intel
make[1]: Entering directory `/lustre/home/cceamgi/repo/pi_examples/fortran_omp_pi_dir'
ifort -O2 -xHost -o pi -fopenmp pi.f90
make[1]: Leaving directory `/lustre/home/cceamgi/repo/pi_examples/fortran_omp_pi_dir'
Calculating PI using:
1000000000 slices
1 OpenMP threads
Obtained value of PI: 3.1415926536
Time taken: 0.87537 seconds
$ OMP_NUM_THREADS=18 ./run.sh
rm -f *.o pi
make -f Makefile.intel
make[1]: Entering directory `/lustre/home/cceamgi/repo/pi_examples/fortran_omp_pi_dir'
ifort -O2 -xHost -o pi -fopenmp pi.f90
make[1]: Leaving directory `/lustre/home/cceamgi/repo/pi_examples/fortran_omp_pi_dir'
Calculating PI using:
1000000000 slices
18 OpenMP threads
Obtained value of PI: 3.1415926536
Time taken: 0.05136 seconds
I added a directory with two multithreaded examples for Julia:
Threads.@threads
macroFLoops.jl
package, which makes writing multi-threaded reductions simpler and less error-prone (you don't have to play with indices arithmetic, and also using SIMD instructions is way simpler).I do a dummy call to the kernel with a small number of steps to warm up.
Benchmarks on Myriad:
For comparison, this is the benchmark of the Fortran+OpenMP example compiled with ifort: