aws / aws-graviton-getting-started

Helping developers to use AWS Graviton2, Graviton3, and Graviton4 processors which power the 6th, 7th, and 8th generation of Amazon EC2 instances (C6g[d], M6g[d], R6g[d], T4g, X2gd, C6gn, I4g, Im4gn, Is4gen, G5g, C7g[d][n], M7g[d], R7g[d], R8g).
https://aws.amazon.com/ec2/graviton/
Other
864 stars 192 forks source link

Quant calculations is taking more time than expected #214

Closed tohwsw closed 2 years ago

tohwsw commented 2 years ago

Hi, we were trying to do a benchmarking on Graviton (c6g.2xlarge) vs non-Graviton (c5.2xlarge) and it seems that calculations on Graviton is slower than it's non-Graviton counterpart. The setup:

Compiled C++ code from https://www.quantstart.com/articles/Asian-option-pricing-with-C-via-Monte-Carlo-Methods/
GCC 11.2.1
Amazon Linux 2022

Here are the flame graph of the two runs. perf-c6g perf -c5

From the graphs it seems the function calc_path_spot_prices is taking more time in Graviton. So I had a look and realised the function is using exp in the calculations. Is the math library not optimized on ARM? How can we optimize the math routines?

Thanks for your help.

sebpop commented 2 years ago

Please add clear steps on how to reproduce your performance measurements.

The flamegraphs do not carry useful information: there are very few functions with the same name on the profiles. linux-perf needs to only measure the compute process, and not the whole system. One of the flamegraphs shows more than 90% of the time in do_idle.

tohwsw commented 2 years ago

Hi Sebastian, uploading the code here with a makefile included. The program is short so it takes about 2-3s to complete. quantasianoptionpricing.zip

sebpop commented 2 years ago

Most of the time spent on c6g is in the __random function.

  49.21%  pricing  libc-2.31.so       [.] __random
  19.15%  pricing  pricing            [.] gaussian_box_muller
  14.11%  pricing  libm-2.31.so       [.] exp@@GLIBC_2.29

gaussian_box_muller has a loop that iterates a random number of times based on the output of rand():

  do {                                                                                                                                                                                                             
    x = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;                                                                                                                                                            
    y = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;                                                                                                                                                            
    euclid_sq = x*x + y*y;                                                                                                                                                                                         
  } while (euclid_sq >= 1.0);  

The total amount of time reported by the program depends on the output of rand(). This is not something we should be looking at with linux-perf.