bcfriesen commented 4 years ago

Background

A goal for ExaRL is to design a performance assessment framework that can be used to track performance of the code over time and during its development. This can be a challenging task for ExaRL due to its reliance on TensorFlow, which is a complex framework that in turn relies on other complex GPU accelerated libraries like cuDNN.

The standard approach of profiling the code as a 'black box' using a tool like NVIDIA Nsight Compute or Nsight Systems does not yield very useful results when the code relies heavily on TensorFlow, which launches millions or billions of GPU kernels in a typical run, and any of those kernels are highly tuned in libraries like NVIDIA cuDNN or Eigen. An example of this problem, drawn from a TensorFlow example code for classifying images, is shown below:

The GPU kernel statistics as reported by Nsight Systems are shown below:

 Time(%)  Total Time (ns)  Instances  Average  Minimum  Maximum                                                  Name                                                
 -------  ---------------  ---------  -------  -------  -------  ----------------------------------------------------------------------------------------------------
    10.6        299116632      75000   3988.2     3071     7296  void tensorflow::functor::ApplyAdamKernel<float>(int, float*, float*, float*, float const*, float c?
    10.6        298053918      19063  15635.2    13984    22879  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     5.8        163429786      19377   8434.2     7264   172734  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align1>(cutlass_80_tensorop_s1688g?
     5.0        139689636      76252   1831.9     1631     4960  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     4.8        135778809      19063   7122.6     7039     7712  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     4.7        131099423      19377   6765.7     6368     8255  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nn_align4>(cutlass_80_tensorop_s1688g?
     4.7        130793263      57189   2287.0     2047     3391  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     4.3        119677513      18750   6382.8     6303     6465  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_tn_align1>(cutlass_80_tensorop_s1688g?
     3.9        110948802      18750   5917.3     5856     6016  ampere_sgemm_32x32_sliced1x4_nt                                                                     
     3.6        102520661      38754   2645.4     2273     9184  void splitKreduce_kernel<float, float, float, float>(cublasSplitKParams<float>, float const*, float?
     3.6         99849301      18750   5325.3     5248     6848  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x64_16x6_nt_align4>(cutlass_80_tensorop_s1688g?
     3.2         89521683      38754   2310.0     1791     3328  void tensorflow::BiasNHWCKernel<float>(int, float const*, float const*, float*, int)                
     2.8         79422299      37500   2117.9     2047     7680  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     2.7         75307469      38126   1975.2     1760     9056  void tensorflow::functor::BlockReduceKernel<float*, float*, 256, tensorflow::functor::Sum<float> >(?
     2.4         66359582      38126   1740.5     1536     3232  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     2.3         65891986      37813   1742.6     1599     3520  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     2.3         64491357      38127   1691.5     1535     5536  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     2.1         57875293      19063   3036.0     2847     4320  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     1.8         50012215      18750   2667.3     2560     2720  void tensorflow::functor::ColumnReduceKernel<float const*, float*, cub::Sum>(float const*, float*, ?
     1.8         49789572      19063   2611.8     2303     2657  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     1.7         47636597      18750   2540.6     2495     2912  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     1.7         47416129      19377   2447.0     1856     2656  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     1.7         47301887      18750   2522.8     2368     9791  void tensorflow::functor::ColumnReduceMax16ColumnsKernel<float const*, float*, cub::Sum>(float cons?
     1.6         45674051      18750   2435.9     2399     3936  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     1.6         43850922      19063   2300.3     2080     5921  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     1.5         42487562      19377   2192.7     1888     2400  void tensorflow::functor::RowReduceKernel<float const*, float*, cub::Max>(float const*, float*, int?
     1.3         36163366      19063   1897.0     1727     3839  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     1.3         35739615      19063   1874.8     1664     2208  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     1.2         33284107      19063   1746.0     1599     2208  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     1.2         32991011      19063   1730.6     1567     2944  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     1.1         32314140      19063   1695.1     1536     8160  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     1.1         32281530      18750   1721.7     1695     3296  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     0.1          1603467        939   1707.6     1567     1921  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     0.0           942101        314   3000.3     2368     3552  void tensorflow::(anonymous namespace)::GenerateNormalizedProb<float, float, 4>(float const*, float?
     0.0           800308        314   2548.8     2144     2624  void tensorflow::functor::RowReduceKernel<cub::TransformInputIterator<float, tensorflow::(anonymous?
     0.0            51743         28   1848.0     1664     2592  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     0.0            27999         14   1999.9     1791     3040  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     0.0            20833          6   3472.2     3040     4544  void tensorflow::BiasGradNHWC_SharedAtomics<float>(int, float const*, float*, int)                  
     0.0            18368          1  18368.0    18368    18368  void tensorflow::concat_variable_kernel<float, int, true>(tensorflow::GpuDeviceArrayStruct<float co?
     0.0            11776          2   5888.0     5056     6720  void tensorflow::functor::FillPhiloxRandomKernelLaunch<tensorflow::random::UniformDistribution<tens?
     0.0             4896          2   2448.0     1952     2944  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     0.0             4703          2   2351.5     1983     2720  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     0.0             4128          2   2064.0     1920     2208  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?
     0.0             1920          1   1920.0     1920     1920  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap?

It is hardly obvious from this output what one should do to improve performance of the code.

So we must adjust this approach in order to be able to collect actionable performance data about ExaRL. Most likely, we will need to use a combination of 'black box' profiling tools and domain-specific tools which have some awareness of what kinds of calculations the code is doing.

Collecting performance data

We can use a combination of performance analysis tools to understand the overall performance characteristics of the code. A few components and proposals are described below.

Timers

These are simple to implement and can summarize strong and weak scaling behavior easily.

MPI analysis

ExaRL uses MPI for inter-node communication. MPI performance is straightforward to measure, including characteristics like time spent at barriers, load imbalance, etc. Many tools can measure these quantities via sampling of each MPI task; this activity has low overhead, and can typically be used for even high-MPI-concurrency runs. Open source tools like TAU and HPCToolkit can be used for this, along with several other proprietary tools like Arm MAP. So far we have been using TAU on the Cori GPU cluster at NERSC, with reasonably good results. The jumpshot GUI shows results like the following:

and the pprof analysis tool shows quantitative results:

FUNCTION SUMMARY (mean):
---------------------------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call 
---------------------------------------------------------------------------------------
100.0      1564456      1832745           1     24624.8 1832744723 .TAU application
 14.4       263892       263892     3274.75           0      80584 MPI_Probe() 
 14.4       263892       263892     3274.75           0      80584 MPI_Probe() [ <comm> = <ranks: 0, 1, 2, 3, 4, 5, 6, 7> <addr=0x44000000> ] 
  0.1         1767         1767           1           0    1766844 MPI_Init() 
  0.1         1255         1255           1           1    1255392 MPI_Finalize() 
  0.0          412          412           1           0     412275 MPI_Comm_dup() 
  0.0          412          412           1           0     412275 MPI_Comm_dup() [ <comm> = <ranks: 0, 1, 2, 3, 4, 5, 6, 7> <addr=0x44000000> ] 
  0.0          364          364     3281.62           0        111 MPI_Send() 
  0.0          364          364     3281.62           0        111 MPI_Send() [ <comm> = <ranks: 0, 1, 2, 3, 4, 5, 6, 7> <addr=0x44000000> <message send path id> = <0> ] 
  0.0          295          295     3280.75           0         90 MPI_Recv() 
  0.0          295          295     3274.75           0         90 MPI_Recv() [ <comm> = <ranks: 0, 1, 2, 3, 4, 5, 6, 7> <addr=0x44000000> ] 
  0.0          205          205           2           0     102716 MPI_Bcast() 
  0.0          205          205           2           0     102716 MPI_Bcast() [ <comm> = <ranks: 0, 1, 2, 3, 4, 5, 6, 7> <addr=0x44000000> ] 
  0.0         73.4         73.4           1           1      73419 MPI_Comm_split() 
  0.0         15.2         15.2     8200.62           0          2 MPI_Comm_rank() 
  0.0         15.2         15.2     8198.62           0          2 MPI_Comm_rank() [ <comm> = <ranks: 0, 1, 2, 3, 4, 5, 6, 7> <addr=0x44000000> ] 
  0.0         5.70         5.70     3282.75           0          2 MPI_Comm_get_attr()
  0.0         2.56         2.56     3274.75           0          1 MPI_Get_count() 
  0.0        0.166        0.166         3.5           0         47 MPI_Recv() [ <comm> = <ranks: 0, 1, 2, 3, 4, 5, 6, 7> <addr=0xffffffff84000002> ] 
  0.0        0.165        0.165         2.5           0         66 MPI_Recv() [ <comm> = <ranks: 0, 1, 2, 3, 4, 5, 6, 7> <addr=0xffffffff84000003> ] 
  0.0        0.059        0.059           1           0         59 MPI_Barrier() 
  0.0        0.059        0.059           1           0         59 MPI_Barrier() [ <comm> = <ranks: 0, 1, 2, 3, 4, 5, 6, 7> <addr=0x44000000> ] 
  0.0        0.020        0.020           3           0          7 MPI_Comm_test_inter() 
  0.0        0.020        0.020           3           0          7 MPI_Comm_test_inter() [ <comm> = <ranks: 0, 1, 2, 3, 4, 5, 6, 7> <addr=0x44000000> ] 
  0.0        0.007        0.016           1           2         16 MPI_Comm_delete_attr()
  0.0        0.014        0.014         1.5           0         10 MPI_Comm_rank() [ <comm> = <ranks: 0, 1, 2, 3, 4, 5, 6, 7> <addr=0xffffffff84000002> ] 
  0.0        0.011        0.011           3           0          4 MPI_Comm_set_attr()
  0.0        0.009        0.009           1           0          9 MPI_Comm_free() 
  0.0        0.006        0.006           3           0          2 MPI_Comm_set_errhandler()
  0.0        0.006        0.006         0.5           0         11 MPI_Comm_rank() [ <comm> = <ranks: 0, 1, 2, 3, 4, 5, 6, 7> <addr=0xffffffff84000003> ] 
  0.0        0.005        0.005           4           0          1 MPI_Finalized()
  0.0        0.005        0.005         7.5           0          1 MPI_Comm_size() 
  0.0        0.003        0.003           2           0          1 MPI_Comm_create_keyval()
  0.0        0.002        0.002           2           0          1 MPI_Comm_free_keyval()

GPU performance

Nsight Systems can be useful for understanding data movement between CPU <-> GPU during the run, and also for gaining insight into any GPU kernels which are not part of TensorFlow. It also uses sampling to collect data, and thus can be used as a 'black box' profiling tool with relatively low overhead; since a typical ExaRL calculation is quite long, it is best to disable CPU sampling in Nsight Systems by adding the -s none flag; otherwise, the resulting profile will be enormous and will take hours to process.

An example profile using Nsight Systems is shown below. In this case, it looks like the majority of GPU activity is spent executing TDLG kernels, in which case Nsight Compute could possibly be used to improve performance.

Nsight Compute can be useful for tuning hand-written kernels, like LibTDLG, but it is much less useful when the runtime is dominated by TensorFlow activity.

TensorFlow performance

TensorFlow includes its own profiling framework which, unlike 'black box' profiling tools, has significant domain-specific awareness about what the calculation is doing. It is likely we will need to rely on this to supplement the above tools if the goal is to improve the performance of the TensorFlow portions of ExaRL.

Representing and tracking performance data

It will be useful to track the performance characteristics of ExaRL as the code base develops. One way to do this is to integrate ExaRL into the ECP GitLab continuous integration infrastructure which is already available at NERSC. We could configure a GitLab runner to launch once per day, check out the latest version of the code, run it, and store the performance characteristics of the code in a file or database that is then visualized on a website. The website can be hosted in Spin.

Some characteristics like execution time, or fraction of time spent in MPI, can be represented simply with timings, and plotted on a graph as a function of git commit or date, like here. Other characteristics which may have more complex information, like the output from Nsight Systems or Nsight Compute or TensorFlow Profiler, may require a different approach for visualizing the data as a function of time.

kennyroche commented 4 years ago

I am interested to talk to you about this work you did. I really appreciate this note. It’s a lot of hard work to get all this information collected and written up simply and clearly as it is here. “It is hardly obvious from this output what one should do to improve performance of the code.” This is the key statement, and there are rarely obvious answers. Whoever wrote this note,I assume Brian, I am curious to talk to you about these results and how / if it is possible to demystify the measurements of code execution. I have no funding for this this moment, but can easily justify discussing the issue of tools and dissecting performance of computer experiments on certain architectures / platforms. Regards, -k

bcfriesen commented 4 years ago

Hi @kennyroche - thanks for the feedback, happy to chat more about this. My e-mail is bfriesen@lbl.gov if you want to continue this discussion there.

cmahrens commented 4 years ago

Thank you, @bcfriesen for this great document. It will be very helpful.

Given the 4 categories of collecting performance data, how would you proceed to tackle the problem in terms of which techniques to use and how to put them together in some sort of assessment?

I assume there could be a few steps: 1) define goals/experiments, for example strong or weak scaling, do the runs and show performance from a few different angles in a way that explains the performance beyond just the overall times and shows the bottlenecks.
2) build in the performance assessment capability first so when we do the runs it is scripted somehow to generate the plots automatically. 3) iterate trying these things with different goals/tools, etc. 4) once we figure out what is most useful and presentable, add it as a CI capability.

Do you have a performance assessment report or document anywhere? If not, I might be able to ask Kenny for one, as I know he has done these before.

BTW, I propose to use TDLG for the assessment, since it is a readily available problem and I don't think we have fully explored it yet.

One last thing -- I am worried that we can't run at scale on Cori, though, so your side would be limited to single or few nodes and what we can show there.

Thanks again, Brian,

Christine

cmahrens commented 4 years ago

@kennyroche would you have an example performance assessment report I could look at to get an idea of what is typically done and scope, etc? Thanks.

bcfriesen commented 4 years ago

Given the 4 categories of collecting performance data, how would you proceed to tackle the problem in terms of which techniques to use and how to put them together in some sort of assessment?

I assume there could be a few steps:

define goals/experiments, for example strong or weak scaling, do the runs and show performance from a few different angles in a way that explains the performance beyond just the overall times and shows the bottlenecks. build in the performance assessment capability first so when we do the runs it is scripted somehow to generate the plots automatically. iterate trying these things with different goals/tools, etc. once we figure out what is most useful and presentable, add it as a CI capability.

I think this is a reasonable set of steps, and describes what many ECP code developers already do.

Do you have a performance assessment report or document anywhere? If not, I might be able to ask Kenny for one, as I know he has done these before.

I don't have anything 'formal' to share, but have done this exercise many times in the context of NESAP. There is a common list of things we evaluate:

% of time spent in computation vs MPI
MPI load (im)balance
memory footprint, memory bandwidth, and average FLOPS for kernels of interest (basically the things you would collect to make a roofline plot)

Most of those topics translate naturally to ExaRL too; these tasks should be straightforward to do since the code runtime is dominated by hand-written CUDA kernels (i.e., not TensorFlow).

cmahrens commented 4 years ago

That sounds good. I would think we want to know those things at various levels of scale, but will probably not be large scale on Cori. Like, what is the time spent in computation vs MPI as you scale? Would like to define weak scaling and do that for weak and strong scaling and also for the CPU vs GPU TDLG.

I also am interested in what we can do with the timers and how that factors into the performance assessment. I'd like something that gives us insight on how the learning is working -- how busy is the learner (master) vs the actors (slaves) so we can see when we might be close to saturating the learner.

Then of course we want to see how fast the learning converges as we scale, which could be different for the CPU vs GPU. Malachi had a graph for strong scaling and showing the convergence, but the learning converges pretty fast for TDLG.

What do you think could be the timeline for this? I assume once you have a methodology for doing the assessment on Cori, we can replicate on Summit. Thanks and have a good weekend.

bcfriesen commented 4 years ago

What do you think could be the timeline for this? I assume once you have a methodology for doing the assessment on Cori, we can replicate on Summit. Thanks and have a good weekend.

It's hard to predict a timeline, as this effort requires some engineering I have done infrequently or not at all before.

Below is something like the 'program of work' that would be required, at least for Cori; some of these steps may not be available on Summit:

Identify the tools required to generate profiling data
- For MPI, this can be done with TAU, which already works on Cori GPU.
- For GPU performance, this can be done with Nsight Systems; I suspect Nsight Compute will not be necessary. These tools are already available on Cori GPU.
- For timings, this will require some work by the code developers who understand which steps of the code are most interesting to measure execution time.
- Anything related to TensorFlow performance/behavior, e.g., collecting data with TensorBoard or other interesting performance data, would require some work by the code developers. You said Malachi has already done some of this, so maybe this step is done too.
Configure a GitLab runner to pull the latest version of the code and run it once per day. I can do this on Cori GPU in around 1 day. This will require moving ExaRL outside of Shifter images, which is a minor change, and should have little impact on job startup time since all of the shared libraries on which ExaRL depends would continue to be inside the Shifter image.
Add some 'post-processing' scripts to the daily GitLab runner which extracts relevant data from the various tools described in step 1, such that it can be presented in a simple way on, e.g., a website. Both TAU and Nsight ~Compute~ Systems have options for emitting performance data in semi-machine-readable formats that will make this easier to parse with a script, although I suspect it will still require some effort noodling around with regex to extract the most interesting information. I imagine this will take 2 weeks.
Configure a place to store the data collected in step 3. Ideally it would be a website which tracks the performance of the code as a function of its git commit history. A natural place for this on Cori is Spin, where one can configure a simple web server to display the data in a human-readable way. This is probably 2-3 weeks of work.

jmohdyusof commented 4 years ago

It seems like 3 would a generally useful tool particularly if it can be configured to read a list of regex terms to match (if I understand the intent) so I could then repurpose it to any workflow?

bcfriesen commented 4 years ago

Yes I think so, although it would be tied somewhat to the tool used to collect the data (e.g., TAU's output looks different than HPCToolkit or Score-P, even though they may be measuring the same things).

jmohdyusof commented 4 years ago

Sure. I imagine the regex list itself would be the list of relevant kernels/subroutines for a particular code, and mostly agnostic to the tool.

jmohdyusof commented 4 years ago

In case this is useful;

https://pop-coe.eu/further-information/online-training/computing-the-pop-metrics-with-pypop

https://github.com/numericalalgorithmsgroup/pypop

bcfriesen commented 4 years ago

Neat! I had not heard of PyPOP before. I'll check this out.

One thing I forgot to add in the 'program of work' described above is identifying a representative problem that can be used for, e.g., an automated nightly benchmark that accurately tracks the performance of the parts of the code that this group cares about. I will bring this up at the call tomorrow morning (Nov 12).

bcfriesen commented 4 years ago

I now have a bot working which uses NERSC's GitLab CI infrastructure to run once a day. It checks out the latest version of the code, and runs the following problem using 1 MPI rank and 1 GPU:

  driver/driver_example.py \
  --output_dir ${output_dir} \
  --env ExaLearnBlockCoPolymerTDLG-v3 \
  --n_episodes 100 \
  --n_steps 10 \
  --learner_type async \
  --agent DQN-v0 \
  --model_type LSTM

I see warning messages like this in the output though:

2020-11-12 19:25:54,067 - WARNING - Training will not be done because this instance is not set to learn.
2020-11-12 19:25:54,067 - WARNING - Weights will not be updated because this instance is not set to learn.

Please let me know if there is a "better" problem config to run. This one takes quite a while, more than half an hour using 1 MPI rank and 1 GPU. That's not a problem for the GitLab runner, but it might cause some headaches when gathering profiling information since most profilers struggle to handle long-running codes.

The next step is adding TAU and Nsight Systems to the runner so that it can start collecting performance data.

rvinaybharadwaj commented 4 years ago

Can you run with more than one MPI rank (at least 2)? This will trigger the use of async learner.

bcfriesen commented 4 years ago

Yep sure can - so the async learning requires > 1 MPI tasks, got it. Thanks!

bcfriesen commented 4 years ago

strawman-plot

While I make progress gathering real data, I generated some fake data in order to make this plot. The x-axis is the time that each git commit was made, in units of UNIX time (i.e., seconds since 1970 or whatever it is). There is probably a better way to plot dates in matplotlib, but I don't know how to do it, being a rather lousy Python programmer.

The x-axis data is real - those are the actual times of the git commits - while the y-axis data is randomly generated.

Does anyone have any feedback about this? Anything from "the plot is too busy" to "why are you using UNIX time format" is useful.

The idea is that there will be a website somewhere (hosted in Spin, which is NERSC's Kubernetes cluster for doing things like hosting science gateways, websites, etc.) which auto-updates this figure each time the GitLab runner runs with the newest version of the code. Then one could simply go to the site to find some performance data about the code as a function of time. One could imagine that each git commit of the code could be a link that one could click in order to go to a page which has more performance data about the code, perhaps the output from nsys profile --stats=true or something like that.

jmohdyusof commented 4 years ago

For the plot, I would think relative performance to a baseline rather than fraction of execution time is more useful, so initial commit is 1.0 for all components. Then I can see which components are improving (or regressing). Also, I would imagine being able to click on a component and link to a page where I can view that performance in various ways (speedup, absolute execution time, etc) would be nice too. Basically if I think of commit version, component, etc as dimensions then I can flatten any one of those as desired?

bcfriesen commented 4 years ago

For the plot, I would think relative performance to a baseline rather than fraction of execution time is more useful, so initial commit is 1.0 for all components. Then I can see which components are improving (or regressing).

Agreed, this is a good idea. I will redo the plot shortly with this change.

Also, I would imagine being able to click on a component and link to a page where I can view that performance in various ways (speedup, absolute execution time, etc) would be nice too. Basically if I think of commit version, component, etc as dimensions then I can flatten any one of those as desired?

Agreed, this will require a fair amount of web development which will be slow because I have very little experience doing web development.

jmohdyusof commented 4 years ago

I feel like we should be able to leverage existing dashboard apps rather than develop our own. IIRC Nvidia and others use the Chrome browser stuff to display their metrics, if we can just dump output to that format would it work?

bcfriesen commented 4 years ago

Yes that's a good point, I would like not to get bogged down in web development more than necessary. You're right that some tools can output to web-friendly formats automatically, like Nsight Systems.

jmohdyusof commented 4 years ago

Do you have a reference for the web-friendly formats that we can use?

bcfriesen commented 4 years ago

Do you have a reference for the web-friendly formats that we can use?

I will generate one today and share it here.

I adjusted the above plot to show relative changes instead of absolute values, as you suggested. I still can't get the dates to convert correctly and it's become a slog, so I will leave it alone for now.

strawman-plot

bcfriesen commented 4 years ago

I will generate one today and share it here.

Never mind, this may not be possible to do as easily as expected - Nsight Systems can output profiles to a few different formats including JSON, but trying to convert the resulting JSON to XML (which a browser might be able to parse more easily) throws an error, at least when using the Python package json2xml. So we may have to parse the data another way to make it web-friendly.

I adjusted the above plot to show relative changes instead of absolute values, as you suggested. I still can't get the dates to convert correctly and it's become a slog, so I will leave it alone for now.

If anybody wants to take a swing at converting the UNIX timestamps to "human" dates, you can reproduce the above plot as follows. First, go to the ExaRL git repository and run

git log --format="%h|%ct"

which will generate output like the following:

cori02:ExaRL.git> git log --format="%h|%ct"|head       
a07c515|1605203937
6830ea1|1605034126
98d2106|1605033039
b2fa95f|1604598077
894e701|1604525991
31feef8|1604521905
a1525a6|1604521865
10481ac|1604521828
a8fbd29|1604521786
2889466|1604442974

The first column is the abbreviated git commit, the second is the timestamp of the commit in the UNIX timestamp format. git does support a few other formats for dates (see git help log for details), so maybe we can generate the correct format directly from git rather than converting it afterwards using Python.

Anyway, if you pipe the output of that git log command to a file, then you can generate fake data using the following script, which assumes the contents of the above git log command are saved to a file called git-log-output.txt:

import numpy as np
import matplotlib.pyplot as plt
import datetime
from scipy.constants import golden
import pandas as pd

data = pd.read_csv('git-log-output.txt', delimiter="|", names=['hash', 'commit_date'])

rng = np.random.default_rng()
percent_mpi = rng.standard_normal(len(data))*0.2
percent_memcpy = rng.standard_normal(len(data))*0.1
runtime = rng.standard_normal(len(data))*0.15

for date in data['commit_date']:
    commit_date = datetime.datetime.fromtimestamp(date)

fig, ax = plt.subplots(figsize=(11.0, 11.0/golden))

colors = plt.cm.viridis(np.linspace(0, 0.9, 3))

plt_mpi, = ax.plot(data['commit_date'], percent_mpi, linestyle=':', color=colors[0], label='% MPI')
plt_memcpy, = ax.plot(data['commit_date'], percent_memcpy, linestyle=':', color=colors[1], label='% D2H/H2D memcpy')
plt_runtime, = ax.plot(data['commit_date'], runtime, linestyle=':', color=colors[2], label='runtime (sec)')

ax.set_xlabel("commit date (UNIX time)")
ax.set_ylabel("relative change")
ax.grid(True)
plt.legend(handles=[plt_mpi, plt_memcpy, plt_runtime], loc='best')
fig.savefig('strawman-plot.png', dpi=300)

jmohdyusof commented 4 years ago

Problem was plots are using raw data: create new array with formatted dates:

new_date=[]
for date in data['commit_date']:
    commit_date = datetime.datetime.fromtimestamp(date).strftime('%Y-%m-%d')
    new_date.append(commit_date)

Use that in the plots

plt_mpi, = ax.plot(new_date, percent_mpi, linestyle=':', color=colors[0], label='% MPI')
plt_memcpy, = ax.plot(new_date, percent_memcpy, linestyle=':', color=colors[1], label='% D2H/H2D memcpy')
plt_runtime, = ax.plot(new_date, runtime, linestyle=':', color=colors[2], label='runtime (sec)')

jmohdyusof commented 4 years ago

bcfriesen commented 4 years ago

Great! Thanks for your help @jmohdyusof, that works! Now the modified script is:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime
from scipy.constants import golden
import pandas as pd

data = pd.read_csv('fake-performance-data.txt', delimiter="|", names=['hash', 'commit_date'])

new_date=[]
for date in data['commit_date']:
    commit_date = datetime.datetime.fromtimestamp(date).strftime('%Y-%m-%d')
    new_date.append(commit_date)

rng = np.random.default_rng()
percent_mpi = rng.standard_normal(len(data))*0.2
percent_memcpy = rng.standard_normal(len(data))*0.1
runtime = rng.standard_normal(len(data))*0.15

fig, ax = plt.subplots(figsize=(11.0, 11.0/golden))

colors = plt.cm.viridis(np.linspace(0, 0.9, 3))

months = mdates.MonthLocator()
days = mdates.DayLocator()

plt_mpi, = ax.plot(new_date, percent_mpi, linestyle=':', color=colors[0], label='% MPI')
plt_memcpy, = ax.plot(new_date, percent_memcpy, linestyle=':', color=colors[1], label='% D2H/H2D memcpy')
plt_runtime, = ax.plot(new_date, runtime, linestyle=':', color=colors[2], label='runtime (sec)')

ax.set_xlabel("commit date (UNIX time)")
ax.set_ylabel("relative change")
fig.autofmt_xdate()
ax.xaxis.set_major_locator(months)
ax.xaxis.set_minor_locator(days)
ax.grid(True)
plt.legend(handles=[plt_mpi, plt_memcpy, plt_runtime], loc='best')
fig.savefig('strawman-plot.png', dpi=300)

and the resulting plot is:

strawman-plot

So is this kind of plot something that would be useful to see? In the CI workflow described above, it would be updated every day after the runner checks out the latest version of the code.

rvinaybharadwaj commented 4 years ago

Thanks, @bcfriesen. Yes, this would be very useful.

exalearn / EXARL

Performance assessment plan #84

Background

Collecting performance data

Timers

MPI analysis

GPU performance

TensorFlow performance

Representing and tracking performance data