Enable OpenMP for Eigen

Luke-Pratley commented 8 years ago

Given that I have started to run tests on the new workstation, I noticed that it is not running much faster than on a laptop.

However, it has 24 cores with two threads each, so could gain a lot from enabling OpenMP in Eigen. It does not look to complicated, but is it too early?

http://eigen.tuxfamily.org/dox/TopicMultiThreading.html

Also, would it also have to be enabled in sopt separately?

mdavezac commented 8 years ago

If you've got working examples, it could be time to start benchmarking. I've written some benchmarking stuff in the benchmarking directory of sopt if you want one way to go about it rationally!

Luke-Pratley commented 8 years ago

Okay cool, I will take a look at that.

Luke-Pratley commented 8 years ago

I have taken a look at the benchmark folder in sopt, and it seems straightforward. I know that benchmarking is testing performance on a given computer. But, what are we actually measuring to quantify the performance?

Luke-Pratley commented 8 years ago

While openmp has now been added to cmake in both Purify and Sopt, it looks like there is no performance boost after compiling with gcc. This is also when ensuring that openmp is actually used at run time: OMP_NUM_THREADS=n ./my_program

adrianjhpc commented 8 years ago

If you can provide details on the setup you used I can have a look and see what's going on.

cheers

adrianj

On 16/05/2016 20:49, Luke Pratley wrote:

While openmp has now been added to cmake in both Purify and Sopt, it looks like there is no performance boost after compiling with gcc. This is also when ensuring that openmp is actually used at run time: |OMP_NUM_THREADS=n ./my_program|

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/astro-informatics/purify/issues/35#issuecomment-219527329

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

mdavezac commented 8 years ago

It tests the time that it takes to run the code in the while loop. Just as you have done manually before. If you want, you can also add the number of items or bytes that each works with. E.g. you can specify the number of pixels, and get back the number of pixels computed per second... Depends on what the task at hand is, really.

Luke-Pratley commented 8 years ago

That is a good point, would the sopt benchmarks run differently if openmp is used for Eigen?

adrianjhpc commented 8 years ago

Which branch is this OpenMP stuff?

Luke-Pratley commented 8 years ago

The only code that should use OpenMP is the Eigen library. It has been added to cmake in https://github.com/astro-informatics/purify/tree/cpp-gridding-refactor, and https://github.com/astro-informatics/sopt/tree/development-c-and-cpp . Though, you have to turn the OpenMP option on in https://github.com/astro-informatics/sopt/blob/development-c-and-cpp/CMakeLists.txt .

So far, I can not see Eigen actually using OpenMP when you turn it on. I made a simple parallel loop, and OpenMP seemed to work for that.

The only part of non-Eigen code that I would want to use OpenMP for is to build a sparse matrix, but I have not coded that yet.

adrianjhpc commented 8 years ago

I'm being dense, but is Eigen an external library or is the source code in the git?

thanks

adrianj

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Luke-Pratley commented 8 years ago

Eigen is an external library that deals with linear algebra, so operations for matrices. Some of what Eigen does can be made parallel using OpenMP. This link http://eigen.tuxfamily.org/dox/TopicMultiThreading.html, suggests that it should be minimal effort to turn it on using OpenMP.

adrianjhpc commented 8 years ago

Thanks.

I notice there is a lot of python in the cmake build, is the python used by sopt at runtime, or is it used to pre/post process data? I ask because I want to run this on a compute node on a production system, so if the python is required at run time I'll need to install the relevant modules on the compute node, otherwise I can just do it all on the login node.

cheers

adrianj

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

mdavezac commented 8 years ago

You can disable the python bindings by running in the build directory:

cmake -Dpython=OFF .

Eigen should be downloaded automatically if it cannot already be found on your system.

adrianjhpc commented 8 years ago

Thanks. I've build sopt, is there any instructions for running a representative benchmark?

cheers

adrianj

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

mdavezac commented 8 years ago

For sopt, you can run the tests by calling in the build directory ctest or make test or individually by calling the executables in 'build/cpp/tests/'. There are a few benchmarks that can be run with make benchmarks or by running the executable in build/cpp/benchmarks. I'm not sure those are the ones that @Luke-Pratley is referring to.

Before running benchmarks, make sure that the code is compile with cmake -DCMAKE_BUILD_TYPE=Release or cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo, or face a slowdown of at least an order of magnitude.

adrianjhpc commented 8 years ago

Hi,

What do the different numbers represent in the benchmark results:

matrix_cgsopt::t_complex/1/4 462 ns 461 ns 1422677 33.0899MB/s

I assume the second last is the number of repetitions, and the last is some measure of data processed, what are the 2 timing numbers?

cheers

adrianj

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

adrianjhpc commented 8 years ago

If I'm reading the results correctly I am getting some speed-up with OpenMP, especially at the larger data sizes:

adrianj@eslogin006:~/CompressiveSensing/build/cpp/benchmarks> export OMP_NUM_THREADS=1 adrianj@eslogin006:~/CompressiveSensing/build/cpp/benchmarks> ./conjugate_gradient Run on (32 X 2600.12 MHz CPU s) 2016-05-18 14:42:07

Benchmark Time CPU Iterations

matrix_cgsopt::t_complex/1/4 462 ns 461 ns 1422677 33.0899MB/s matrix_cgsopt::t_complex/1/8 452 ns 452 ns 1548577 33.7542MB/s matrix_cgsopt::t_complex/1/12 456 ns 457 ns 1548577 33.3728MB/s matrix_cgsopt::t_complex/8/4 1997 ns 2000 ns 349979 61.0277MB/s matrix_cgsopt::t_complex/8/8 2034 ns 2023 ns 349977 60.3378MB/s matrix_cgsopt::t_complex/8/12 2004 ns 2000 ns 349979 61.0277MB/s matrix_cgsopt::t_complex/64/4 233549 ns 233359 ns 3017 4.18481MB/s matrix_cgsopt::t_complex/64/8 231904 ns 231839 ns 2916 4.21225MB/s matrix_cgsopt::t_complex/64/12 237333 ns 236011 ns 3017 4.13779MB/s matrix_cgsopt::t_complex/256/4 12617150 ns 12582600 ns 55 317.899kB/s matrix_cgsopt::t_complex/256/8 12670344 ns 12655327 ns 55 316.072kB/s matrix_cgsopt::t_complex/256/12 12738743 ns 12715071 ns 56 314.587kB/s matrix_cgsopt::t_real/1/4 429 ns 429 ns 1650842 17.7883MB/s matrix_cgsopt::t_real/1/8 428 ns 427 ns 1620269 17.8626MB/s matrix_cgsopt::t_real/1/12 430 ns 430 ns 1635415 17.7222MB/s matrix_cgsopt::t_real/8/4 822 ns 821 ns 833284 74.3515MB/s matrix_cgsopt::t_real/8/8 879 ns 878 ns 874945 69.53MB/s matrix_cgsopt::t_real/8/12 830 ns 828 ns 874945 73.7556MB/s matrix_cgsopt::t_real/64/4 55236 ns 55274 ns 13461 8.83379MB/s matrix_cgsopt::t_real/64/8 55695 ns 55688 ns 12499 8.76816MB/s matrix_cgsopt::t_real/64/12 56005 ns 56008 ns 12499 8.71807MB/s matrix_cgsopt::t_real/256/4 2819567 ns 2813183 ns 246 710.938kB/s matrix_cgsopt::t_real/256/8 2859138 ns 2845707 ns 246 702.813kB/s matrix_cgsopt::t_real/256/12 2830512 ns 2832176 ns 250 706.171kB/s function_cgsopt::t_complex/1/4 540 ns 538 ns 1249933 28.3799MB/s function_cgsopt::t_complex/1/8 543 ns 541 ns 1249911 28.2115MB/s function_cgsopt::t_complex/1/12 539 ns 538 ns 1346076 28.3677MB/s function_cgsopt::t_complex/8/4 2121 ns 2120 ns 330168 57.5731MB/s function_cgsopt::t_complex/8/8 2114 ns 2116 ns 336517 57.6912MB/s function_cgsopt::t_complex/8/12 2117 ns 2108 ns 330168 57.904MB/s function_cgsopt::t_complex/64/4 230422 ns 230633 ns 3070 4.23427MB/s function_cgsopt::t_complex/64/8 223806 ns 222752 ns 3017 4.38408MB/s function_cgsopt::t_complex/64/12 222200 ns 222168 ns 3241 4.39561MB/s function_cgsopt::t_complex/256/4 12105000 ns 12069724 ns 58 331.408kB/s function_cgsopt::t_complex/256/8 12011668 ns 11931776 ns 58 335.239kB/s function_cgsopt::t_complex/256/12 11969081 ns 11931776 ns 58 335.239kB/s function_cgsopt::t_real/1/4 504 ns 502 ns 1346076 15.191MB/s function_cgsopt::t_real/1/8 500 ns 500 ns 1000000 15.2578MB/s function_cgsopt::t_real/1/12 504 ns 502 ns 1346076 15.191MB/s function_cgsopt::t_real/8/4 907 ns 904 ns 760828 67.4918MB/s function_cgsopt::t_real/8/8 898 ns 894 ns 760820 68.2851MB/s function_cgsopt::t_real/8/12 898 ns 894 ns 760820 68.2851MB/s function_cgsopt::t_real/64/4 55669 ns 55368 ns 12499 8.81886MB/s function_cgsopt::t_real/64/8 57608 ns 57264 ns 11666 8.52686MB/s function_cgsopt::t_real/64/12 55157 ns 55048 ns 12499 8.87012MB/s function_cgsopt::t_real/256/4 2811393 ns 2813183 ns 246 710.938kB/s function_cgsopt::t_real/256/8 2821580 ns 2816176 ns 250 710.183kB/s function_cgsopt::t_real/256/12 2814477 ns 2813183 ns 246 710.938kB/s adrianj@eslogin006:~/CompressiveSensing/build/cpp/benchmarks> export OMP_NUM_THREADS=4 adrianj@eslogin006:~/CompressiveSensing/build/cpp/benchmarks> ./conjugate_gradient Run on (32 X 2600.12 MHz CPU s) 2016-05-18 14:44:22

Benchmark Time CPU Iterations

matrix_cgsopt::t_complex/1/4 461 ns 460 ns 1548577 33.1853MB/s matrix_cgsopt::t_complex/1/8 458 ns 458 ns 1346076 33.3413MB/s matrix_cgsopt::t_complex/1/12 480 ns 480 ns 1534990 31.8215MB/s matrix_cgsopt::t_complex/8/4 2040 ns 2029 ns 343115 60.1746MB/s matrix_cgsopt::t_complex/8/8 2026 ns 2016 ns 357121 60.5433MB/s matrix_cgsopt::t_complex/8/12 1989 ns 1982 ns 343115 61.5905MB/s matrix_cgsopt::t_complex/64/4 140625 ns 280748 ns 2465 3.47843MB/s matrix_cgsopt::t_complex/64/8 143167 ns 285616 ns 2465 3.41914MB/s matrix_cgsopt::t_complex/64/12 141520 ns 282915 ns 2333 3.45179MB/s matrix_cgsopt::t_complex/256/4 3951178 ns 15721930 ns 43 254.422kB/s matrix_cgsopt::t_complex/256/8 3769963 ns 15044413 ns 46 265.879kB/s matrix_cgsopt::t_complex/256/12 3890023 ns 15511163 ns 49 257.879kB/s matrix_cgsopt::t_real/1/4 467 ns 466 ns 1534990 16.3552MB/s matrix_cgsopt::t_real/1/8 457 ns 455 ns 1521647 16.7753MB/s matrix_cgsopt::t_real/1/12 477 ns 475 ns 1508526 16.0732MB/s matrix_cgsopt::t_real/8/4 877 ns 874 ns 833274 69.857MB/s matrix_cgsopt::t_real/8/8 871 ns 870 ns 795409 70.1515MB/s matrix_cgsopt::t_real/8/12 858 ns 855 ns 833284 71.4275MB/s matrix_cgsopt::t_real/64/4 36078 ns 72006 ns 9722 6.7811MB/s matrix_cgsopt::t_real/64/8 36078 ns 72006 ns 9722 6.7811MB/s matrix_cgsopt::t_real/64/12 37001 ns 74063 ns 9722 6.59274MB/s matrix_cgsopt::t_real/256/4 1024208 ns 4090650 ns 177 488.92kB/s matrix_cgsopt::t_real/256/8 923726 ns 3681080 ns 188 543.319kB/s matrix_cgsopt::t_real/256/12 951992 ns 3810763 ns 190 524.829kB/s function_cgsopt::t_complex/1/4 626 ns 624 ns 1166589 24.45MB/s function_cgsopt::t_complex/1/8 635 ns 633 ns 1093682 24.1145MB/s function_cgsopt::t_complex/1/12 629 ns 631 ns 1166608 24.1847MB/s function_cgsopt::t_complex/8/4 2167 ns 2157 ns 330168 56.6028MB/s function_cgsopt::t_complex/8/8 2143 ns 2136 ns 324053 57.1601MB/s function_cgsopt::t_complex/8/12 2266 ns 2251 ns 318162 54.2398MB/s function_cgsopt::t_complex/64/4 142670 ns 284630 ns 2333 3.43099MB/s function_cgsopt::t_complex/64/8 141829 ns 283705 ns 2397 3.44217MB/s function_cgsopt::t_complex/64/12 141908 ns 283145 ns 2430 3.44898MB/s function_cgsopt::t_complex/256/4 4250315 ns 16889933 ns 45 236.827kB/s function_cgsopt::t_complex/256/8 3923766 ns 15556556 ns 45 257.126kB/s function_cgsopt::t_complex/256/12 4169982 ns 16609717 ns 46 240.823kB/s function_cgsopt::t_real/1/4 548 ns 548 ns 1000000 13.9214MB/s function_cgsopt::t_real/1/8 544 ns 544 ns 1346076 14.0288MB/s function_cgsopt::t_real/1/12 557 ns 556 ns 1346076 13.7287MB/s function_cgsopt::t_real/8/4 946 ns 944 ns 729121 64.6791MB/s function_cgsopt::t_real/8/8 970 ns 967 ns 760820 63.0895MB/s function_cgsopt::t_real/8/12 957 ns 955 ns 729121 63.9357MB/s function_cgsopt::t_real/64/4 36656 ns 73241 ns 9722 6.6668MB/s function_cgsopt::t_real/64/8 36556 ns 72829 ns 9722 6.70448MB/s function_cgsopt::t_real/64/12 36992 ns 74063 ns 9722 6.59274MB/s function_cgsopt::t_real/256/4 991639 ns 3955045 ns 177 505.683kB/s function_cgsopt::t_real/256/8 983011 ns 3929653 ns 170 508.951kB/s function_cgsopt::t_real/256/12 936271 ns 3720661 ns 186 537.539kB/s

The data number (MB/s) looks lower on the 4 thread example, but I think that might be because you're dividing by CPU time rather than runtime...

cheers

adrianj

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Luke-Pratley commented 8 years ago

I think the data sets I was trying might have been too small. I will check later and let you know.

adrianjhpc commented 8 years ago

I committed a couple of changes to the benchmarks which should make them more accurate.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

mdavezac commented 8 years ago

cpu time vs real time: http://service.futurequest.net/index.php?/Knowledgebase/Article/View/407

adrianjhpc commented 8 years ago

Real time is what you want for these codes, if you're spending a lot of time in system time it's still costing you run time and you want to know about it, and if you're using multiple threads than cpu time does not give you the correct time (it generally gives you the summed time of all the threads).

real time might not give you good benchmark data if you're running on a shared system, with other stuff running whilst you are benchmarking, but you shouldn't be benchmarking like this anyway as it will not give you accurate results.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

mdavezac commented 8 years ago

That's fine. At this point, the benchmarks are more for demonstration purposes. I have not actually used them much or thought about it sufficiently. Certainly for threaded apps, we will have to user real-time.

jasonmcewen commented 7 years ago

Can this issue be closed?

astro-informatics / purify

Enable OpenMP for Eigen #35

Benchmark Time CPU Iterations

Benchmark Time CPU Iterations