Monitor resource usage - Githubissues

Baharis commented 1 month ago

To profile SPREAD refinement, I wrote myself a relatively simple resource monitor that gathers CPU/GPU usage/memory %, logs, and then plots it. Since it gained traction in the "Xfels are great!" chatroom, I decided to generalize it and present it in a pull request. It still offers rather low time resolution, but since the last discussion, I made it more robust and extendable to different CPU/GPU architectures.

The main class to interface here can be accessed via from libtbx.resource_monitor import ResourceMonitor (exact file placement to be discussed). Resource monitor by default works with mpi. It collects information about individual ranks with no communication – communicating ranks offered some cool benefits, but also caused significant problems and slow-downs. The ResourceMonitor can be used in the following ways:

As a decorator – to monitor every call of a given function, decorate the function definition itself. The monitor will start every time the function is called and stop every time the function is terminated:
```
@ResourceMonitor(*args)
def function_to_be_monitored()
stuff_to_be_timed()
```
As a context manager – to monitor a part of the code, place it withing a context manager using a with statement. The monitor will start at the start of the with block and terminate at the end of the with block:
```
with ResourceMonitor(*args):
stuff_to_be_timed()
```
As a standalone instance – for more advanced or customizable application, the context manager can be manually instantiated, started, stopped, etc. The following would be roughly equivalent to context manager approach:
```
rm = ResourceMonitor(*args):
try:
um.start()
stuff_to_be_timed()
finally:
self.stop()
```

In particular, using the last approach I wrote a simple cctbx.xfel.merge MonitorWorker accessible via dispatch keyword monitor. The first (and every odd) monitor in dispatch.step_list starts a global instance of ResourceMonitor. The second (and every even) monitor stops the same global instance. Since the monitor worker ignors additional_info, I suggest appending monitor with meaningful suffixes, for example: dispatch.step_list = input balance monitor_start annulus monitor_stop trumpet to time the annulus worker only.

Both ResourceMonitor and MonitorWorker accept the following arguments/monitor-scope phil parameters:

detail: choice = *rank node rank0 none – Detail of data to be collected: from every rank, from rank 0 only, from first rank on every node, or none.
period: float = 5.0 – Interval between subsequent resource statistics checks in seconds. Short periods might lead to inconsistent logging.
plot: bool = True – Plot the resource usage history after the monitor is stopped.
prefix: str = monitor – Filename prefix for log files and summary plot.
write: bool = True – Write collected resource information to log files.

Log files are written in the working directory and feature human- and computer-readable structure that can be used to recreate plots if necessary:

2024-05-28 17:26:07,205 - Collecting CPU stats with self.rank_info.cpu_probe.kind=psutil
2024-05-28 17:26:07,205 - Collecting GPU stats with self.rank_info.cpu_probe.kind=Nvidia
2024-05-28 17:26:07,257 - UsageStats(cpu_usage=0.0, cpu_memory=0.28741941078671696, gpu_usage=0.0, gpu_memory=0.0418212890625)
2024-05-28 17:26:12,275 - UsageStats(cpu_usage=98.2, cpu_memory=0.2874998541016293, gpu_usage=0.0, gpu_memory=0.0418212890625)
2024-05-28 17:26:17,280 - UsageStats(cpu_usage=99.5, cpu_memory=0.2874998541016293, gpu_usage=0.0, gpu_memory=0.0418212890625)

Simple line plots are produced whenever ResourceMonitor stops i.e. whenever decorated code terminates successfully or raises an Exception:

(this image is slightly outdated, but better showcases the worker than the ones obtained in later debugging)

Baharis commented 1 month ago

Notes:

Generalizing this took slightly longer than expected.
Since I am medically unable to rebase as well as apparently addicted to watching the commit count tick up, this will have to be squash-merged if approved.
@nksauter This is the tool I used to profile SPREAD step 11, scenario S1 refinement.
@phyy-nx, @dermen, @dwpaley, I took the liberty to mark you as reviewers in case you have time and liberty to decorate with @UsageMonitor() some slow function of your choice (diffbragg? ccebx.xfel itself?) to check if you like it.

Baharis commented 1 month ago

This branch was originally built on top of branch memory_policy, but I want to merge it into master instead since none of the changes done here strictly relate to the memory policy; in order to use the ResourceMonitor on top of the changes made to memory_policy, use branch memory_policy_monitor_backup which stops at commit 37ee7e34850638c02c43cc63894eed67319efab0, before the cctbx master merge has been merged into it.

Baharis commented 1 month ago

Note: while testing some hanging code with ResourceManager I realized it could be easily adapted to automatically detect and MPI.Abort() hanging processes. Here is one example where the process hangs after the 15-minute mark.

dermen commented 3 weeks ago

Also, I like this feature, but not sure if anyone else does. Just makes it easier for another application to put all the logs in one place..

diff --git a/libtbx/resource_monitor.py b/libtbx/resource_monitor.py
index 588c6b0cbc..f624be883f 100644
--- a/libtbx/resource_monitor.py
+++ b/libtbx/resource_monitor.py
@@ -268,6 +268,13 @@ class RankInfo:

 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ MONITORING ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #

+def _make_dir_if_missing(prefix):
+  if comm.rank==0:
+    dirname = os.path.dirname(prefix)
+    if dirname:
+      os.makedirs(dirname, exist_ok=True)
+  comm.barrier()
+  return prefix

 class ResourceMonitor(ContextDecorator):
   """
@@ -323,10 +330,10 @@ class ResourceMonitor(ContextDecorator):
     self.detail: 'ResourceMonitor.Detail' = self.Detail(detail)
     self.period: float = period  # <5 sec. de-prioritizes sub-procs & they stop
     self.plot: bool = plot
-    self.prefix: str = prefix if prefix else 'monitor'
+    self.prefix: str = _make_dir_if_missing(prefix) if prefix else 'monitor'
     self.write: bool = write
     self.rank_info: RankInfo = RankInfo()
-    logger_name = 'libtbx.resource_monitor.' + self.prefix
+    logger_name = 'libtbx.resource_monitor.' + os.path.basename(self.prefix)
     self.log_manager = ResourceLogManager(logger_name)
     self.log: logging.Logger = self.get_logger()
     self.log.info(f'Collecting CPU stats with {self.rank_info.cpu_probe.kind=}')

dermen commented 3 weeks ago

Beyond that, I think the figure could use larger font by default, and be compressed to display fully on a smaller monitor screen - but whatever.. Just my two cents :)

dermen commented 3 weeks ago

Me running ResourceMonitor on diffBragg.stills_process (Formerly known as diffBragg.hopper_process, its essentially dials.stills_process, but with a per-shot diffBragg refinement step on GPUs):

Screenshot 2024-06-10 at 10 57 30 PM

Above is for 4 nodes, 4 GPUs per node, 4 ranks per GPU (16 ranks per node)

dermen commented 3 weeks ago

By eye, it seems my CPU memory usage has a slight slope to it.. Wonder if its worth displaying a slope of sorts on these plots to indicate potential memory leak

Baharis commented 3 weeks ago

@dermen Thanks for these great suggestions and the fix for python <3.9; Alcc-recipes now installs Python3.9, so I didn't catch it. The _make_dir_if_missing also seems worth adding just in case someone doesn't want their work directory buried under thousands of files. Also, now that you mentioned it, the font is indeed tiny, slight upgrade here won't hurt.

As for the slope, do you think auto-fitting a line there offers more information than an informed view? Outliers are bound to happen (such as the 0% at the start) and I think there might be better points to focus on than adding a Huber loss function or something to auto-cover these cases – maybe an orthogonal grid would suffice.

Also, after manually generating plots for a few of my failed jobs I noticed that I strongly need to expose the manual plot_logs function as a bash command, opening python to repeatedly type two lines gets annoying quickly.

So thanks once again and changes coming tomorrow!

dermen commented 3 weeks ago

@Baharis , I think adding the gridlines will suffice!

Baharis commented 3 weeks ago

@dermen Updated. I increased the font size only a bit (10->12), more and it quickly became cluttered. Also, either I have forgotten how to properly LIBTBX_SET_DISPATCHER_NAME or libtbx has some special behavior here – whatever the case, the name libtbx.resource_monitor_plot doesn't want to register properly 🤷.

dermen commented 3 weeks ago

Looks all good to me!

dermen commented 3 weeks ago

Regarding the command line script, Im not the expert, but maybe try putting a thin wrapper to your plot_logs function in

libtbx/command_line/run_resource_monitor_plot.py

run_resource_monitor_plot.py

``` from __futute__ import division # LIBTBX_SET_DISPATCHER_NAME libtbx.awesome_plot import argparse import inspect from libtbx.resource_monitor import plot_logs # parser = argparse.ArgumentParser(description=str(inspect.getdoc(plot_logs))) parser.add_argument('prefix', type=str, default='monitor*.log', help='Glob matching all log files to be plotted') parser.add_argument('-o', '--output', type=str, default='monitor.png', help='Filepath to save the final plot under') args = parser.parse_args() plot_logs(log_glob=args.prefix, save_path=args.output) ```

Add the dispatch flag in that script, and then add the imports / run your application (oh, and then do a libtbx.refresh

Baharis commented 3 weeks ago

@dermen It completely slipped my attention that the wrapper should be inside the command_line/ directory! Thank you so much!!! Now if the monitor fails for any reason, produced logs can be still used to create the plot by simply calling libtbx.resource_monitor_plot (no additional args necessary unless you change monitor's base name).

Baharis commented 3 weeks ago

Bonus image of the most recent version of the plot with updated grid and font style; in this particular case, the ranks were failing one by one, which manifests here as CPU usage raising to 100%.

Baharis commented 3 weeks ago

I will squash-merge this into master as soon as all tests succeed.

cctbx / cctbx_project

Monitor resource usage #994