MountaintopLotus / braintrust

A Dockerized platform for running Stable Diffusion, on AWS (for now)
Apache License 2.0
1 stars 2 forks source link

Monitoring system resource usages #113

Open JohnTigue opened 1 year ago

JohnTigue commented 1 year ago

BrainTrust containers should have easy-to-use ways of monitoring usage of the system resources such as storage, network, and compute, especially the GPU. We want to have monitors implemented for both GUI and TUI.

Of course, Jupyter has terminals, so TUI monitor solutions are a cheap, not-too-clucky way of getting monitoring dashboards into Jupyter. That is, a TUI solution can do double duty as a GUI solution,

A promising legit real GUI solution might be a Jupyter widget. Jupyter widgets can run outside of Jupyter notebooks, so such a GUI monitoring widget could also be using in a clusterwide dashboard, not just for individual servers.

See also #98 as we would like the TUI solutions to work well with tmux. We want the TUI dashboards to be implemented as tmux sessions. (Hopefully, the Jupyter terminal will also work well with tmux…)

JohnTigue commented 1 year ago

For GPU monitoring in a TUI context there are multiple options:

JohnTigue commented 1 year ago

One TUI way that might well work with the Jupyter terminal would simply be to clear and nvidia-smi -l 5 i.e. just run nvidia-smi every 5 seconds.

JohnTigue commented 1 year ago

And could start the tmux session with one-time status like nvidia-smi --list-gpus.

JohnTigue commented 1 year ago

Here's [1] a super simple, not bad at all way of doing TUI in GUI (I.e. running inside a Jupyter CLI terminal:

watch -d -n 0.5 nvidia-smi

man watch tells us the -d flag highlights differences between the outputs, so it can aid in highlighting which metrics are changing over time

I just tested that now and it works well, including continually highlight the delta (inverts text/bg => white/back) which below is the time and temp changing.

Screenshot 2023-09-24 at 6 12 47 PM
JohnTigue commented 1 year ago

It also sounds like tmux runs inside of Jupyter's terminal. That's great. Sounds like it was actually running better in Classic than the newer JupyterLab. The issue is still open: Unable to override tmux mouse mode in jupyterlab terminal #13005.

So, we should definitely see if we can set up a nice tmux dashboard, that runs within Jupyter's terminal.

JohnTigue commented 1 year ago

Network traffic is another useful monitor. For example, if many gigabytes of data need to be downloaded during a set-up which has no UI feedback while each individual file downloads, then peeking at the network traffic in is a way to see that something is happening.

JohnTigue commented 1 year ago

Kaggle has a system monitor in notebooks. It has UI for each of two GPU.

I wonder if that is a Jupyter or a Kaggle thing (widget?).

Screenshot 2023-09-27 at 3 14 01 PM
JohnTigue commented 1 year ago

2020, Identify and monitor NVIDIA GPU usage in Kaggle notebooks:

(EDIT — Though initially, I was not able to use nvidia-smi inside a Kaggle kernel, later on, I found an alternative path from where it could be used and updated my kaggle notebook to show that. This post, however, is still relevant for anyone interested to know about a python package using which we can programmatically identify and monitor GPU usage.)

Seemingly the other solution involves pynvml as per Kaggle: pynvml module to identify and monitor GPU usage.

JohnTigue commented 1 year ago

This looks really pretty: nvitop. (10 GPUs!)

68747470733a2f2f757365722d696d616765732e67697468756275736572636f6e74656e742e636f6d2f31363037383333322f3137313030353236312d31616164313236652d646332372d346564332d613839622d3766396331633939386266372e706e67

JohnTigue commented 1 year ago

SageMaker has "SageMaker Debugger" which has nice UI: Monitor the system resource utilization using SageMaker Studio.