XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.61k stars 144 forks source link

[Feature Request] Collect metrics in a fixed interval for the lifespan of a training job #47

Closed hosseinsarshar closed 1 year ago

hosseinsarshar commented 1 year ago

Hi @XuehaiPan,

In your examples to collect metrics using ResourceMetricCollector inside a training loop, the collector.collect(), collects a snapshot at each epoch/batch loop which misses the the entire period between the previous and current loop. If a loop takes 5 minutes, we have the metrics at 5 minutes interval.

I wonder if there is a way to run a process in background to collect the metrics at a certain interval let's say 5 seconds, during the lifespan of a training job?

Therefore if the entire job took 1hr, with the 5 sec interval, we collect 720 snapshots.

Thanks

XuehaiPan commented 1 year ago

@classicboyir Hi, thanks for the feedback.

I wonder if there is a way to run a process in background to collect the metrics at a certain internal let's say 5 seconds, during the lifespan of a training job?

I think this would be a good use case and I would like to add this into nvitop. It can achieve by running in a separate thread with a callback function, like:

import time
import threading

from nvitop import ResourceMetricCollector

def collect_in_background(
    on_collect,
    collector=None,
    interval=None,
    *,
    on_start=None,
    on_stop=None,
    tag='metrics-daemon',
    start=True,
):
    if collector is None:
        collector = ResourceMetricCollector()
    if interval is None:
        interval = collector.interval
    interval = min(interval, collector.interval)

    def target():
        if on_start is not None:
            on_start(collector)
        try:
            with collector(tag):
                try:
                    while on_collect(collector.collect()):
                        time.sleep(interval)
                except KeyboardInterrupt:
                    pass
        finally:
            if on_stop is not None:
                on_stop(collector)

    daemon = threading.Thread(target=target, daemon=True)
    if start:
        daemon.start()
    return daemon
def main():
    logger = ...

    def on_collect(metrics):
        if logger.is_closed():  # closed manually by user
            return False
        logger.log(metrics)
        return True

    def on_stop(collector):
        if not logger.is_closed():
            logger.close()  # cleanup

    background_collector = ResourceMetricCollector()
    collect_in_background(on_collect, background_collector, interval=5.0, on_stop=on_stop)

    # Use a separate collector for foreground
    # otherwise it will mess with the 'metrics-daemon' tag
    foreground_collector = ResourceMetricCollector()

    for epoch in range(100):
        with foreground_collector('epoch'):
            # Do something
            for batch in range(100):
                with foreground_collector('batch'):
                    # Do something
                    pass

You can define a on_collect, such as log the result to a logger, or just append it in a list:

lst = [] 

def on_collect(metrics):
    lst.append(metrics)
    return True
hosseinsarshar commented 1 year ago

Love it, thanks for the quick response and look forward to seeing it being natively supported.

XuehaiPan commented 1 year ago

@classicboyir Hi, I create a PR #48 to resolve this. Could you try:

pip3 install git+https://github.com/XuehaiPan/nvitop.git@collector-daemon

and share some user experiences. Then we can get it to merge and release. Thanks!

hosseinsarshar commented 1 year ago

thanks for the update, @XuehaiPan. I gave this a try, I love it and it works as expected. I do have a suggestion on the design of the method.

I think it'd be better to define collect_in_background as a member of ResourceMetricCollector class and you call it like this: (and use something like begin_collecting_in_background as the function name)

collector = ResourceMetricCollector(interval=5.0) daemon = collector.begin_collecting_in_background(on_collect, on_stop=on_stop)

Instead of passing a ResourceMetricCollector object, it uses self as the collector and might just need these parameters in the begin_collecting_in_background function:

def begin_collecting_in_background(
        on_collect,
        on_start=None,
        on_stop=None,
        tag='') -> threading.Thread:

And you don't need the start parameter as when you call the begin_collecting_in_background function the intention is to start the background thread. Similarly, interval could be eliminated as it grabs the interval parameter of the ResourceMetricCollector class. Finally it'd return the daemon object to stop the job for the client to manage the thread.

XuehaiPan commented 1 year ago

@classicboyir Thanks for the advice, I add a new shortcut method daemonize to the class ResourceMetricCollector:

from nvitop import ResourceMetricCollector

collector = ResourceMetricCollector(...)
collector.daemonize(on_collect_fn, interval=inteval, on_start=on_start, on_stop=on_stop)

it is equivalent to:

from nvitop import ResourceMetricCollector, collect_in_background

collector = ResourceMetricCollector(...)
collect_in_background(on_collect_fn, collector, interval=inteval, on_start=on_start, on_stop=on_stop)

but has fewer imports.


And you don't need the start parameter as when you call the begin_collecting_in_background function the intention is to start the background thread. Similarly, interval could be eliminated as it grabs the interval parameter of the ResourceMetricCollector class.

As for the parameter on_start, I think the user may look up the collector.devices or some other attributes at start-up. This method not only initializes the collector but also does some necessary jobs on start.

For the interval argument, if you omit or pass interval=None, it will use collecor.interval.

XuehaiPan commented 1 year ago

This feature is included in nvitop 0.10.2.

hosseinsarshar commented 1 year ago

Thanks @XuehaiPan for adding this feature promptly. Would you also expose a function to stop the background thread when needed?

XuehaiPan commented 1 year ago

Would you also expose a function to stop the background thread when needed?

@classicboyir You can let the on_collect function return False to stop the thread. Also, the thread is a daemon thread, you can kill it anyway without breaking the main thread.