Closed bcfriesen closed 2 years ago
I am interested to talk to you about this work you did. I really appreciate this note. It’s a lot of hard work to get all this information collected and written up simply and clearly as it is here. “It is hardly obvious from this output what one should do to improve performance of the code.” This is the key statement, and there are rarely obvious answers. Whoever wrote this note,I assume Brian, I am curious to talk to you about these results and how / if it is possible to demystify the measurements of code execution. I have no funding for this this moment, but can easily justify discussing the issue of tools and dissecting performance of computer experiments on certain architectures / platforms. Regards, -k
Hi @kennyroche - thanks for the feedback, happy to chat more about this. My e-mail is bfriesen@lbl.gov if you want to continue this discussion there.
Thank you, @bcfriesen for this great document. It will be very helpful.
Given the 4 categories of collecting performance data, how would you proceed to tackle the problem in terms of which techniques to use and how to put them together in some sort of assessment?
I assume there could be a few steps:
1) define goals/experiments, for example strong or weak scaling, do the runs and show performance from a few different angles in a way that explains the performance beyond just the overall times and shows the bottlenecks.
2) build in the performance assessment capability first so when we do the runs it is scripted somehow to generate the plots automatically.
3) iterate trying these things with different goals/tools, etc.
4) once we figure out what is most useful and presentable, add it as a CI capability.
Do you have a performance assessment report or document anywhere? If not, I might be able to ask Kenny for one, as I know he has done these before.
BTW, I propose to use TDLG for the assessment, since it is a readily available problem and I don't think we have fully explored it yet.
One last thing -- I am worried that we can't run at scale on Cori, though, so your side would be limited to single or few nodes and what we can show there.
Thanks again, Brian,
Christine
@kennyroche would you have an example performance assessment report I could look at to get an idea of what is typically done and scope, etc? Thanks.
Given the 4 categories of collecting performance data, how would you proceed to tackle the problem in terms of which techniques to use and how to put them together in some sort of assessment?
I assume there could be a few steps:
define goals/experiments, for example strong or weak scaling, do the runs and show performance from a few different angles in a way that explains the performance beyond just the overall times and shows the bottlenecks. build in the performance assessment capability first so when we do the runs it is scripted somehow to generate the plots automatically. iterate trying these things with different goals/tools, etc. once we figure out what is most useful and presentable, add it as a CI capability.
I think this is a reasonable set of steps, and describes what many ECP code developers already do.
Do you have a performance assessment report or document anywhere? If not, I might be able to ask Kenny for one, as I know he has done these before.
I don't have anything 'formal' to share, but have done this exercise many times in the context of NESAP. There is a common list of things we evaluate:
Most of those topics translate naturally to ExaRL too; these tasks should be straightforward to do since the code runtime is dominated by hand-written CUDA kernels (i.e., not TensorFlow).
That sounds good. I would think we want to know those things at various levels of scale, but will probably not be large scale on Cori. Like, what is the time spent in computation vs MPI as you scale? Would like to define weak scaling and do that for weak and strong scaling and also for the CPU vs GPU TDLG.
I also am interested in what we can do with the timers and how that factors into the performance assessment. I'd like something that gives us insight on how the learning is working -- how busy is the learner (master) vs the actors (slaves) so we can see when we might be close to saturating the learner.
Then of course we want to see how fast the learning converges as we scale, which could be different for the CPU vs GPU. Malachi had a graph for strong scaling and showing the convergence, but the learning converges pretty fast for TDLG.
What do you think could be the timeline for this? I assume once you have a methodology for doing the assessment on Cori, we can replicate on Summit. Thanks and have a good weekend.
What do you think could be the timeline for this? I assume once you have a methodology for doing the assessment on Cori, we can replicate on Summit. Thanks and have a good weekend.
It's hard to predict a timeline, as this effort requires some engineering I have done infrequently or not at all before.
Below is something like the 'program of work' that would be required, at least for Cori; some of these steps may not be available on Summit:
Identify the tools required to generate profiling data
Configure a GitLab runner to pull the latest version of the code and run it once per day. I can do this on Cori GPU in around 1 day. This will require moving ExaRL outside of Shifter images, which is a minor change, and should have little impact on job startup time since all of the shared libraries on which ExaRL depends would continue to be inside the Shifter image.
Add some 'post-processing' scripts to the daily GitLab runner which extracts relevant data from the various tools described in step 1, such that it can be presented in a simple way on, e.g., a website. Both TAU and Nsight ~Compute~ Systems have options for emitting performance data in semi-machine-readable formats that will make this easier to parse with a script, although I suspect it will still require some effort noodling around with regex to extract the most interesting information. I imagine this will take 2 weeks.
Configure a place to store the data collected in step 3. Ideally it would be a website which tracks the performance of the code as a function of its git commit history. A natural place for this on Cori is Spin, where one can configure a simple web server to display the data in a human-readable way. This is probably 2-3 weeks of work.
It seems like 3 would a generally useful tool particularly if it can be configured to read a list of regex terms to match (if I understand the intent) so I could then repurpose it to any workflow?
Yes I think so, although it would be tied somewhat to the tool used to collect the data (e.g., TAU's output looks different than HPCToolkit or Score-P, even though they may be measuring the same things).
Sure. I imagine the regex list itself would be the list of relevant kernels/subroutines for a particular code, and mostly agnostic to the tool.
Neat! I had not heard of PyPOP before. I'll check this out.
One thing I forgot to add in the 'program of work' described above is identifying a representative problem that can be used for, e.g., an automated nightly benchmark that accurately tracks the performance of the parts of the code that this group cares about. I will bring this up at the call tomorrow morning (Nov 12).
I now have a bot working which uses NERSC's GitLab CI infrastructure to run once a day. It checks out the latest version of the code, and runs the following problem using 1 MPI rank and 1 GPU:
driver/driver_example.py \
--output_dir ${output_dir} \
--env ExaLearnBlockCoPolymerTDLG-v3 \
--n_episodes 100 \
--n_steps 10 \
--learner_type async \
--agent DQN-v0 \
--model_type LSTM
I see warning messages like this in the output though:
2020-11-12 19:25:54,067 - WARNING - Training will not be done because this instance is not set to learn.
2020-11-12 19:25:54,067 - WARNING - Weights will not be updated because this instance is not set to learn.
Please let me know if there is a "better" problem config to run. This one takes quite a while, more than half an hour using 1 MPI rank and 1 GPU. That's not a problem for the GitLab runner, but it might cause some headaches when gathering profiling information since most profilers struggle to handle long-running codes.
The next step is adding TAU and Nsight Systems to the runner so that it can start collecting performance data.
Can you run with more than one MPI rank (at least 2)? This will trigger the use of async learner.
Yep sure can - so the async learning requires > 1 MPI tasks, got it. Thanks!
While I make progress gathering real data, I generated some fake data in order to make this plot. The x-axis is the time that each git commit was made, in units of UNIX time (i.e., seconds since 1970 or whatever it is). There is probably a better way to plot dates in matplotlib, but I don't know how to do it, being a rather lousy Python programmer.
The x-axis data is real - those are the actual times of the git commits - while the y-axis data is randomly generated.
Does anyone have any feedback about this? Anything from "the plot is too busy" to "why are you using UNIX time format" is useful.
The idea is that there will be a website somewhere (hosted in Spin, which is NERSC's Kubernetes cluster for doing things like hosting science gateways, websites, etc.) which auto-updates this figure each time the GitLab runner runs with the newest version of the code. Then one could simply go to the site to find some performance data about the code as a function of time. One could imagine that each git commit of the code could be a link that one could click in order to go to a page which has more performance data about the code, perhaps the output from nsys profile --stats=true
or something like that.
For the plot, I would think relative performance to a baseline rather than fraction of execution time is more useful, so initial commit is 1.0 for all components. Then I can see which components are improving (or regressing). Also, I would imagine being able to click on a component and link to a page where I can view that performance in various ways (speedup, absolute execution time, etc) would be nice too. Basically if I think of commit version, component, etc as dimensions then I can flatten any one of those as desired?
For the plot, I would think relative performance to a baseline rather than fraction of execution time is more useful, so initial commit is 1.0 for all components. Then I can see which components are improving (or regressing).
Agreed, this is a good idea. I will redo the plot shortly with this change.
Also, I would imagine being able to click on a component and link to a page where I can view that performance in various ways (speedup, absolute execution time, etc) would be nice too. Basically if I think of commit version, component, etc as dimensions then I can flatten any one of those as desired?
Agreed, this will require a fair amount of web development which will be slow because I have very little experience doing web development.
I feel like we should be able to leverage existing dashboard apps rather than develop our own. IIRC Nvidia and others use the Chrome browser stuff to display their metrics, if we can just dump output to that format would it work?
Yes that's a good point, I would like not to get bogged down in web development more than necessary. You're right that some tools can output to web-friendly formats automatically, like Nsight Systems.
Do you have a reference for the web-friendly formats that we can use?
Do you have a reference for the web-friendly formats that we can use?
I will generate one today and share it here.
I adjusted the above plot to show relative changes instead of absolute values, as you suggested. I still can't get the dates to convert correctly and it's become a slog, so I will leave it alone for now.
I will generate one today and share it here.
Never mind, this may not be possible to do as easily as expected - Nsight Systems can output profiles to a few different formats including JSON, but trying to convert the resulting JSON to XML (which a browser might be able to parse more easily) throws an error, at least when using the Python package json2xml. So we may have to parse the data another way to make it web-friendly.
I adjusted the above plot to show relative changes instead of absolute values, as you suggested. I still can't get the dates to convert correctly and it's become a slog, so I will leave it alone for now.
If anybody wants to take a swing at converting the UNIX timestamps to "human" dates, you can reproduce the above plot as follows. First, go to the ExaRL git repository and run
git log --format="%h|%ct"
which will generate output like the following:
cori02:ExaRL.git> git log --format="%h|%ct"|head
a07c515|1605203937
6830ea1|1605034126
98d2106|1605033039
b2fa95f|1604598077
894e701|1604525991
31feef8|1604521905
a1525a6|1604521865
10481ac|1604521828
a8fbd29|1604521786
2889466|1604442974
The first column is the abbreviated git commit, the second is the timestamp of the commit in the UNIX timestamp format. git does support a few other formats for dates (see git help log
for details), so maybe we can generate the correct format directly from git rather than converting it afterwards using Python.
Anyway, if you pipe the output of that git log
command to a file, then you can generate fake data using the following script, which assumes the contents of the above git log
command are saved to a file called git-log-output.txt
:
import numpy as np
import matplotlib.pyplot as plt
import datetime
from scipy.constants import golden
import pandas as pd
data = pd.read_csv('git-log-output.txt', delimiter="|", names=['hash', 'commit_date'])
rng = np.random.default_rng()
percent_mpi = rng.standard_normal(len(data))*0.2
percent_memcpy = rng.standard_normal(len(data))*0.1
runtime = rng.standard_normal(len(data))*0.15
for date in data['commit_date']:
commit_date = datetime.datetime.fromtimestamp(date)
fig, ax = plt.subplots(figsize=(11.0, 11.0/golden))
colors = plt.cm.viridis(np.linspace(0, 0.9, 3))
plt_mpi, = ax.plot(data['commit_date'], percent_mpi, linestyle=':', color=colors[0], label='% MPI')
plt_memcpy, = ax.plot(data['commit_date'], percent_memcpy, linestyle=':', color=colors[1], label='% D2H/H2D memcpy')
plt_runtime, = ax.plot(data['commit_date'], runtime, linestyle=':', color=colors[2], label='runtime (sec)')
ax.set_xlabel("commit date (UNIX time)")
ax.set_ylabel("relative change")
ax.grid(True)
plt.legend(handles=[plt_mpi, plt_memcpy, plt_runtime], loc='best')
fig.savefig('strawman-plot.png', dpi=300)
Problem was plots are using raw data: create new array with formatted dates:
new_date=[]
for date in data['commit_date']:
commit_date = datetime.datetime.fromtimestamp(date).strftime('%Y-%m-%d')
new_date.append(commit_date)
Use that in the plots
plt_mpi, = ax.plot(new_date, percent_mpi, linestyle=':', color=colors[0], label='% MPI')
plt_memcpy, = ax.plot(new_date, percent_memcpy, linestyle=':', color=colors[1], label='% D2H/H2D memcpy')
plt_runtime, = ax.plot(new_date, runtime, linestyle=':', color=colors[2], label='runtime (sec)')
Great! Thanks for your help @jmohdyusof, that works! Now the modified script is:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime
from scipy.constants import golden
import pandas as pd
data = pd.read_csv('fake-performance-data.txt', delimiter="|", names=['hash', 'commit_date'])
new_date=[]
for date in data['commit_date']:
commit_date = datetime.datetime.fromtimestamp(date).strftime('%Y-%m-%d')
new_date.append(commit_date)
rng = np.random.default_rng()
percent_mpi = rng.standard_normal(len(data))*0.2
percent_memcpy = rng.standard_normal(len(data))*0.1
runtime = rng.standard_normal(len(data))*0.15
fig, ax = plt.subplots(figsize=(11.0, 11.0/golden))
colors = plt.cm.viridis(np.linspace(0, 0.9, 3))
months = mdates.MonthLocator()
days = mdates.DayLocator()
plt_mpi, = ax.plot(new_date, percent_mpi, linestyle=':', color=colors[0], label='% MPI')
plt_memcpy, = ax.plot(new_date, percent_memcpy, linestyle=':', color=colors[1], label='% D2H/H2D memcpy')
plt_runtime, = ax.plot(new_date, runtime, linestyle=':', color=colors[2], label='runtime (sec)')
ax.set_xlabel("commit date (UNIX time)")
ax.set_ylabel("relative change")
fig.autofmt_xdate()
ax.xaxis.set_major_locator(months)
ax.xaxis.set_minor_locator(days)
ax.grid(True)
plt.legend(handles=[plt_mpi, plt_memcpy, plt_runtime], loc='best')
fig.savefig('strawman-plot.png', dpi=300)
and the resulting plot is:
So is this kind of plot something that would be useful to see? In the CI workflow described above, it would be updated every day after the runner checks out the latest version of the code.
Thanks, @bcfriesen. Yes, this would be very useful.
Background
A goal for ExaRL is to design a performance assessment framework that can be used to track performance of the code over time and during its development. This can be a challenging task for ExaRL due to its reliance on TensorFlow, which is a complex framework that in turn relies on other complex GPU accelerated libraries like cuDNN.
The standard approach of profiling the code as a 'black box' using a tool like NVIDIA Nsight Compute or Nsight Systems does not yield very useful results when the code relies heavily on TensorFlow, which launches millions or billions of GPU kernels in a typical run, and any of those kernels are highly tuned in libraries like NVIDIA cuDNN or Eigen. An example of this problem, drawn from a TensorFlow example code for classifying images, is shown below:
The GPU kernel statistics as reported by Nsight Systems are shown below:
It is hardly obvious from this output what one should do to improve performance of the code.
So we must adjust this approach in order to be able to collect actionable performance data about ExaRL. Most likely, we will need to use a combination of 'black box' profiling tools and domain-specific tools which have some awareness of what kinds of calculations the code is doing.
Collecting performance data
We can use a combination of performance analysis tools to understand the overall performance characteristics of the code. A few components and proposals are described below.
Timers
These are simple to implement and can summarize strong and weak scaling behavior easily.
MPI analysis
ExaRL uses MPI for inter-node communication. MPI performance is straightforward to measure, including characteristics like time spent at barriers, load imbalance, etc. Many tools can measure these quantities via sampling of each MPI task; this activity has low overhead, and can typically be used for even high-MPI-concurrency runs. Open source tools like TAU and HPCToolkit can be used for this, along with several other proprietary tools like Arm MAP. So far we have been using TAU on the Cori GPU cluster at NERSC, with reasonably good results. The
jumpshot
GUI shows results like the following:and the
pprof
analysis tool shows quantitative results:GPU performance
Nsight Systems can be useful for understanding data movement between CPU <-> GPU during the run, and also for gaining insight into any GPU kernels which are not part of TensorFlow. It also uses sampling to collect data, and thus can be used as a 'black box' profiling tool with relatively low overhead; since a typical ExaRL calculation is quite long, it is best to disable CPU sampling in Nsight Systems by adding the
-s none
flag; otherwise, the resulting profile will be enormous and will take hours to process.An example profile using Nsight Systems is shown below. In this case, it looks like the majority of GPU activity is spent executing TDLG kernels, in which case Nsight Compute could possibly be used to improve performance.
Nsight Compute can be useful for tuning hand-written kernels, like LibTDLG, but it is much less useful when the runtime is dominated by TensorFlow activity.
TensorFlow performance
TensorFlow includes its own profiling framework which, unlike 'black box' profiling tools, has significant domain-specific awareness about what the calculation is doing. It is likely we will need to rely on this to supplement the above tools if the goal is to improve the performance of the TensorFlow portions of ExaRL.
Representing and tracking performance data
It will be useful to track the performance characteristics of ExaRL as the code base develops. One way to do this is to integrate ExaRL into the ECP GitLab continuous integration infrastructure which is already available at NERSC. We could configure a GitLab runner to launch once per day, check out the latest version of the code, run it, and store the performance characteristics of the code in a file or database that is then visualized on a website. The website can be hosted in Spin.
Some characteristics like execution time, or fraction of time spent in MPI, can be represented simply with timings, and plotted on a graph as a function of git commit or date, like here. Other characteristics which may have more complex information, like the output from Nsight Systems or Nsight Compute or TensorFlow Profiler, may require a different approach for visualizing the data as a function of time.