Closed hosseinsarshar closed 1 year ago
@classicboyir Hi, thanks for the feedback.
I wonder if there is a way to run a process in background to collect the metrics at a certain internal let's say 5 seconds, during the lifespan of a training job?
I think this would be a good use case and I would like to add this into nvitop
. It can achieve by running in a separate thread with a callback function, like:
import time
import threading
from nvitop import ResourceMetricCollector
def collect_in_background(
on_collect,
collector=None,
interval=None,
*,
on_start=None,
on_stop=None,
tag='metrics-daemon',
start=True,
):
if collector is None:
collector = ResourceMetricCollector()
if interval is None:
interval = collector.interval
interval = min(interval, collector.interval)
def target():
if on_start is not None:
on_start(collector)
try:
with collector(tag):
try:
while on_collect(collector.collect()):
time.sleep(interval)
except KeyboardInterrupt:
pass
finally:
if on_stop is not None:
on_stop(collector)
daemon = threading.Thread(target=target, daemon=True)
if start:
daemon.start()
return daemon
def main():
logger = ...
def on_collect(metrics):
if logger.is_closed(): # closed manually by user
return False
logger.log(metrics)
return True
def on_stop(collector):
if not logger.is_closed():
logger.close() # cleanup
background_collector = ResourceMetricCollector()
collect_in_background(on_collect, background_collector, interval=5.0, on_stop=on_stop)
# Use a separate collector for foreground
# otherwise it will mess with the 'metrics-daemon' tag
foreground_collector = ResourceMetricCollector()
for epoch in range(100):
with foreground_collector('epoch'):
# Do something
for batch in range(100):
with foreground_collector('batch'):
# Do something
pass
You can define a on_collect
, such as log the result to a logger, or just append it in a list
:
lst = []
def on_collect(metrics):
lst.append(metrics)
return True
Love it, thanks for the quick response and look forward to seeing it being natively supported.
@classicboyir Hi, I create a PR #48 to resolve this. Could you try:
pip3 install git+https://github.com/XuehaiPan/nvitop.git@collector-daemon
and share some user experiences. Then we can get it to merge and release. Thanks!
thanks for the update, @XuehaiPan. I gave this a try, I love it and it works as expected. I do have a suggestion on the design of the method.
I think it'd be better to define collect_in_background as a member of ResourceMetricCollector class and you call it like this: (and use something like begin_collecting_in_background
as the function name)
collector = ResourceMetricCollector(interval=5.0) daemon = collector.begin_collecting_in_background(on_collect, on_stop=on_stop)
Instead of passing a ResourceMetricCollector object, it uses self as the collector and might just need these parameters in the begin_collecting_in_background function:
def begin_collecting_in_background(
on_collect,
on_start=None,
on_stop=None,
tag='') -> threading.Thread:
And you don't need the start
parameter as when you call the begin_collecting_in_background
function the intention is to start the background thread. Similarly, interval
could be eliminated as it grabs the interval parameter of the ResourceMetricCollector class. Finally it'd return the daemon
object to stop the job for the client to manage the thread.
@classicboyir Thanks for the advice, I add a new shortcut method daemonize
to the class ResourceMetricCollector
:
from nvitop import ResourceMetricCollector
collector = ResourceMetricCollector(...)
collector.daemonize(on_collect_fn, interval=inteval, on_start=on_start, on_stop=on_stop)
it is equivalent to:
from nvitop import ResourceMetricCollector, collect_in_background
collector = ResourceMetricCollector(...)
collect_in_background(on_collect_fn, collector, interval=inteval, on_start=on_start, on_stop=on_stop)
but has fewer imports.
And you don't need the
start
parameter as when you call thebegin_collecting_in_background
function the intention is to start the background thread. Similarly,interval
could be eliminated as it grabs the interval parameter of the ResourceMetricCollector class.
As for the parameter on_start
, I think the user may look up the collector.devices
or some other attributes at start-up. This method not only initializes the collector
but also does some necessary jobs on start.
For the interval
argument, if you omit or pass interval=None
, it will use collecor.interval
.
This feature is included in nvitop 0.10.2
.
Thanks @XuehaiPan for adding this feature promptly. Would you also expose a function to stop the background thread when needed?
Would you also expose a function to stop the background thread when needed?
@classicboyir You can let the on_collect
function return False
to stop the thread. Also, the thread is a daemon thread, you can kill it anyway without breaking the main thread.
Hi @XuehaiPan,
In your examples to collect metrics using
ResourceMetricCollector
inside a training loop, the collector.collect(), collects a snapshot at each epoch/batch loop which misses the the entire period between the previous and current loop. If a loop takes 5 minutes, we have the metrics at 5 minutes interval.I wonder if there is a way to run a process in background to collect the metrics at a certain interval let's say 5 seconds, during the lifespan of a training job?
Therefore if the entire job took 1hr, with the 5 sec interval, we collect 720 snapshots.
Thanks