damonge / CoLoRe

CoLoRe - Cosmological Lofty Realizations
GNU General Public License v3.0
17 stars 13 forks source link

Compile CoLoRe at NERSC with KNL architecture #54

Closed andreufont closed 4 years ago

andreufont commented 4 years ago

Starting today, the KNL nodes in Cori will be almost half the prize than the Haswell nodes.

So far I have been running always on Haswell nodes, but I wonder whether it would be worth it long term to compile and test the code on KNL nodes as well. It is possible that the code will be slower, but there is a support team aimed at helping with the profiling in the new architecture, see below.

--- email from NERSC on January 14th 2020 ---

Need Help Switching to Cori KNL Nodes? KNL Office Hours on Fridays All Month!

NERSC will hold virtual office hours over Zoom from 9:00 am to 3:00 pm Pacific Time for every Friday in January, including this Friday (January 17), to help users get their codes running efficiently on the Cori KNL nodes.

For many users, running efficiently on the KNL nodes is as simple as making sure that their job script is set to request the proper thread affinity on the node, and their executable is compiled correctly to exploit the KNL architecture. We have seen a performance gap shrink by a factor of 7 just with these two simple steps.

Other user codes may require some relatively straightforward code changes (for example, a loop reordering to exploit vectorization). Profiling the code is the first step towards finding these hot spots or bottlenecks.

During the KNL Office Hours, NERSC experts will be on hand to help you take these steps. Please (virtually) drop by for help with

Setting up your job script for proper thread affinity
Compiling your code with the best optimization flags
Getting started with profiling your code
Interpreting the results of profiling, and advice on how to proceed

A podcast from May provides additional information about the office hours.

Join online at https://lbnl.zoom.us/j/943079374 or view the event on the NERSC Public Events calendar for full connection information.

damonge commented 4 years ago

Never tried KNL myself. When I tried this type of architecture in the deep past, you had to know very well how to structure things in a vectorizable way (i.e. it's not just a matter of compiling the code, you need to rewrite parts of it). Still, it's probably the future, so this would be a good idea. @fjaviersanchez , have you worked with these?

fjaviersanchez commented 4 years ago

I had managed to run CoLoRe on KNL nodes in the past but it was really slow (partially because I used the same compilation or almost the same as Haswell). In order to get good performance some parts of the code probably need to be rewritten as @damonge said. I can make some tests at some point this week if this is something useful.

andreufont commented 4 years ago

There is really no urgency here, we are not planning to run many simulations in the near term.

I just wanted to document this option for the future.

fjaviersanchez commented 4 years ago

I managed to compile at KNL and it's better that I remembered but there is margin for improvement. The bottomline is that KNL is ~4x slower. Using the same config file in 1 node KNL (272 threads) and Haswell (64 threads) I get the following:

Haswell

Creating Fourier-space density and Newtonian potential 
>    Relative time ellapsed 402.0 ms
Transforming density and Newtonian potential
>    Relative time ellapsed 370.4 ms
Normalizing density and Newtonian potential 
>    Relative time ellapsed 40.9 ms
 <d>=1.417E-11, <d^2>=6.533E-01

*** Creating physical matter density
Lognormalizin'
>    Relative time ellapsed 101.8 ms

*** Computing normalization of density field
z=0.000E+00, <d^2_0>=1.023E+00, 00000000000
z=3.760E-02, <d^2_0>=1.023E+00, 00000179080
z=8.054E-02, <d^2_0>=9.870E-01, 00001201000
z=1.286E-01, <d^2_0>=1.007E+00, 00003111032
z=1.779E-01, <d^2_0>=9.993E-01, 00005766488
z=2.276E-01, <d^2_0>=1.001E+00, 00009033384
z=2.775E-01, <d^2_0>=9.999E-01, 00012780408
z=3.275E-01, <d^2_0>=1.003E+00, 00016906456
z=3.775E-01, <d^2_0>=9.964E-01, 00021299152
z=4.017E-01, <d^2_0>=9.964E-01, 00021752168
>    Relative time ellapsed 80.4 ms

*** Getting point sources
 0-th galaxy population
   Poisson-sampling
   There will be 1406503 objects in total 
   Assigning coordinates
>    Relative time ellapsed 1221.2 ms

*** Re-distributing sources across nodes
>    Relative time ellapsed 27.5 ms

*** Getting LOS information
Communication 0, Node 0 is now Node 0
>    Relative time ellapsed 1666.3 ms

*** Writing kappa source maps
>    Relative time ellapsed 25.2 ms

*** Writing shear source maps
>    Relative time ellapsed 78145.8 ms

*** Writing source catalogs
 0-th population (FITS)
>    Relative time ellapsed 167.7 ms
|-------------------------------------------------|

>    Total time ellapsed 84106.6 ms

KNL

*** Creating Gaussian density field 
Creating Fourier-space density and Newtonian potential 
>    Relative time ellapsed 582.3 ms
Transforming density and Newtonian potential
>    Relative time ellapsed 16254.1 ms
Normalizing density and Newtonian potential 
>    Relative time ellapsed 6.6 ms
 <d>=3.365E-12, <d^2>=6.535E-01

*** Creating physical matter density
Lognormalizin'
>    Relative time ellapsed 86.7 ms

*** Computing normalization of density field
z=0.000E+00, <d^2_0>=9.665E-01, 00000000000
z=3.760E-02, <d^2_0>=9.666E-01, 00000179080
z=8.054E-02, <d^2_0>=9.910E-01, 00001201000
z=1.286E-01, <d^2_0>=1.007E+00, 00003111032
z=1.779E-01, <d^2_0>=1.000E+00, 00005766488
z=2.276E-01, <d^2_0>=1.001E+00, 00009033384
z=2.775E-01, <d^2_0>=1.001E+00, 00012780408
z=3.275E-01, <d^2_0>=1.001E+00, 00016906456
z=3.775E-01, <d^2_0>=9.967E-01, 00021299152
z=4.017E-01, <d^2_0>=9.967E-01, 00021752168
>    Relative time ellapsed 136.8 ms

*** Getting point sources
 0-th galaxy population
   Poisson-sampling
   There will be 1401256 objects in total 
   Assigning coordinates
>    Relative time ellapsed 695.9 ms

*** Re-distributing sources across nodes
>    Relative time ellapsed 198.4 ms

*** Getting LOS information
Communication 0, Node 0 is now Node 0
>    Relative time ellapsed 1424.2 ms

*** Writing kappa source maps
>    Relative time ellapsed 38.4 ms

*** Writing shear source maps
>    Relative time ellapsed 297547.3 ms

*** Writing source catalogs
 0-th population (FITS)
>    Relative time ellapsed 872.4 ms

|-------------------------------------------------|

>    Total time ellapsed 320184.3 ms

So basically a lot of the steps are similarly fast (or faster) in KNL due to the increased number of threads (64 haswell vs 272 at KNL) but the slowest steps are way slower in KNL...

Update: I made another run with a larger dr_shear (the last time it was 2 Mpc and that's why the process was kind of slow, this time I used 30 which is a more "realistic" case). The slowdown is still ~x4.

damonge commented 4 years ago

OK, very useful. So, FFTs are definitely slower in KNL. There may be an easy way of linking with the right libraries there or something like that, so maybe this is not as hard. Some of the other stuff is to do with I/O, so not sure we can do much about that.

fjaviersanchez commented 4 years ago

Ok, I think that we can close this issue for now, since I compiled the code at KNL and it works but, feel free to reopen :)