Closed Baharis closed 3 weeks ago
Notes:
@UsageMonitor()
some slow function of your choice (diffbragg? ccebx.xfel itself?) to check if you like it.This branch was originally built on top of branch memory_policy
, but I want to merge it into master
instead since none of the changes done here strictly relate to the memory policy; in order to use the ResourceMonitor
on top of the changes made to memory_policy
, use branch memory_policy_monitor_backup
which stops at commit 37ee7e34850638c02c43cc63894eed67319efab0
, before the cctbx master
merge has been merged into it.
Note: while testing some hanging code with ResourceManager
I realized it could be easily adapted to automatically detect and MPI.Abort()
hanging processes. Here is one example where the process hangs after the 15-minute mark.
Also, I like this feature, but not sure if anyone else does. Just makes it easier for another application to put all the logs in one place..
diff --git a/libtbx/resource_monitor.py b/libtbx/resource_monitor.py
index 588c6b0cbc..f624be883f 100644
--- a/libtbx/resource_monitor.py
+++ b/libtbx/resource_monitor.py
@@ -268,6 +268,13 @@ class RankInfo:
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ MONITORING ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
+def _make_dir_if_missing(prefix):
+ if comm.rank==0:
+ dirname = os.path.dirname(prefix)
+ if dirname:
+ os.makedirs(dirname, exist_ok=True)
+ comm.barrier()
+ return prefix
class ResourceMonitor(ContextDecorator):
"""
@@ -323,10 +330,10 @@ class ResourceMonitor(ContextDecorator):
self.detail: 'ResourceMonitor.Detail' = self.Detail(detail)
self.period: float = period # <5 sec. de-prioritizes sub-procs & they stop
self.plot: bool = plot
- self.prefix: str = prefix if prefix else 'monitor'
+ self.prefix: str = _make_dir_if_missing(prefix) if prefix else 'monitor'
self.write: bool = write
self.rank_info: RankInfo = RankInfo()
- logger_name = 'libtbx.resource_monitor.' + self.prefix
+ logger_name = 'libtbx.resource_monitor.' + os.path.basename(self.prefix)
self.log_manager = ResourceLogManager(logger_name)
self.log: logging.Logger = self.get_logger()
self.log.info(f'Collecting CPU stats with {self.rank_info.cpu_probe.kind=}')
Beyond that, I think the figure could use larger font by default, and be compressed to display fully on a smaller monitor screen - but whatever.. Just my two cents :)
Me running ResourceMonitor on diffBragg.stills_process
(Formerly known as diffBragg.hopper_process
, its essentially dials.stills_process
, but with a per-shot diffBragg refinement step on GPUs):
Above is for 4 nodes, 4 GPUs per node, 4 ranks per GPU (16 ranks per node)
By eye, it seems my CPU memory usage has a slight slope to it.. Wonder if its worth displaying a slope of sorts on these plots to indicate potential memory leak
@dermen Thanks for these great suggestions and the fix for python <3.9; Alcc-recipes now installs Python3.9, so I didn't catch it. The _make_dir_if_missing
also seems worth adding just in case someone doesn't want their work directory buried under thousands of files. Also, now that you mentioned it, the font is indeed tiny, slight upgrade here won't hurt.
As for the slope, do you think auto-fitting a line there offers more information than an informed view? Outliers are bound to happen (such as the 0% at the start) and I think there might be better points to focus on than adding a Huber loss function or something to auto-cover these cases – maybe an orthogonal grid would suffice.
Also, after manually generating plots for a few of my failed jobs I noticed that I strongly need to expose the manual plot_logs
function as a bash command, opening python to repeatedly type two lines gets annoying quickly.
So thanks once again and changes coming tomorrow!
@Baharis , I think adding the gridlines will suffice!
@dermen Updated. I increased the font size only a bit (10->12), more and it quickly became cluttered. Also, either I have forgotten how to properly LIBTBX_SET_DISPATCHER_NAME
or libtbx
has some special behavior here – whatever the case, the name libtbx.resource_monitor_plot
doesn't want to register properly 🤷.
Looks all good to me!
Regarding the command line script, Im not the expert, but maybe try putting a thin wrapper to your plot_logs function in
libtbx/command_line/run_resource_monitor_plot.py
Add the dispatch flag in that script, and then add the imports / run your application (oh, and then do a libtbx.refresh
@dermen It completely slipped my attention that the wrapper should be inside the command_line/
directory! Thank you so much!!! Now if the monitor fails for any reason, produced logs can be still used to create the plot by simply calling libtbx.resource_monitor_plot
(no additional args necessary unless you change monitor's base name).
Bonus image of the most recent version of the plot with updated grid and font style; in this particular case, the ranks were failing one by one, which manifests here as CPU usage raising to 100%.
I will squash-merge this into master as soon as all tests succeed.
To profile SPREAD refinement, I wrote myself a relatively simple resource monitor that gathers CPU/GPU usage/memory %, logs, and then plots it. Since it gained traction in the "Xfels are great!" chatroom, I decided to generalize it and present it in a pull request. It still offers rather low time resolution, but since the last discussion, I made it more robust and extendable to different CPU/GPU architectures.
The main class to interface here can be accessed via
from libtbx.resource_monitor import ResourceMonitor
(exact file placement to be discussed). Resource monitor by default works with mpi. It collects information about individual ranks with no communication – communicating ranks offered some cool benefits, but also caused significant problems and slow-downs. TheResourceMonitor
can be used in the following ways:As a decorator – to monitor every call of a given function, decorate the function definition itself. The monitor will start every time the function is called and stop every time the function is terminated:
As a context manager – to monitor a part of the code, place it withing a context manager using a
with
statement. The monitor will start at the start of thewith
block and terminate at the end of thewith
block:As a standalone instance – for more advanced or customizable application, the context manager can be manually instantiated, started, stopped, etc. The following would be roughly equivalent to context manager approach:
In particular, using the last approach I wrote a simple
cctbx.xfel.merge
MonitorWorker
accessible via dispatch keywordmonitor
. The first (and every odd)monitor
indispatch.step_list
starts a global instance ofResourceMonitor
. The second (and every even)monitor
stops the same global instance. Since themonitor
worker ignorsadditional_info
, I suggest appendingmonitor
with meaningful suffixes, for example:dispatch.step_list = input balance monitor_start annulus monitor_stop trumpet
to time theannulus
worker only.Both
ResourceMonitor
andMonitorWorker
accept the following arguments/monitor-scope phil parameters:detail: choice = *rank node rank0 none
– Detail of data to be collected: from every rank, from rank 0 only, from first rank on every node, or none.period: float = 5.0
– Interval between subsequent resource statistics checks in seconds. Short periods might lead to inconsistent logging.plot: bool = True
– Plot the resource usage history after the monitor is stopped.prefix: str = monitor
– Filename prefix for log files and summary plot.write: bool = True
– Write collected resource information to log files.Log files are written in the working directory and feature human- and computer-readable structure that can be used to recreate plots if necessary:
Simple line plots are produced whenever
ResourceMonitor
stops i.e. whenever decorated code terminates successfully or raises an Exception:(this image is slightly outdated, but better showcases the worker than the ones obtained in later debugging)