Closed yarikoptic closed 2 months ago
Hi @yarikoptic! Thanks for the shout-out.:)
It's been a while I've worked on brainlife so things might have changed, but for all the jobs executed via brainlife, it starts a service called smon ("service monitor" or "session monitor"). This script runs the ps command periodically to gather runtime information about all the processes that are running under the same session ID (ps -s <session id>
). Please see https://github.com/brainlife/abcd-spec/blob/master/hooks/smon#L144. It writes out information into a local file, and brainlife's amaretti later picks it up and uses it for visualization purposes. Some PBS/slurm/condor clusters have the monitoring plugin installed, but it was non-homogeneous and challenging to normalize the information across different systems if they were even available. smon is simple, yet it can uns on pretty much any system. A job can spawn a bunch of different child processes, so capturing the entire process session is important to monitor the entire job execution.
For visualization, the code to display the smon output is built into the brainlife (warehouse)'s web UI so I am not sure if you are interested in using that. All it does is just display the smon output in a simple timeseries chart.
Cheers!
time
FWIW here is our attempt to play with smon by starting a new session with bash process under a new SID to run smon and target compute:
ps -p $$ -o sid=; rm -f _smon.out; setsid bash -c 'ps -p $$ -o sid=; /home/yoh/proj/misc/abcd-spec/hooks/smon & jobs ; ./abandoning_parent.sh 5 ./consume_mem.py 1020 2 ; kill %1'
might need to kill it more gently or make it actually react to KILL signal and cause dump of stats.
@satra @djarecka -- could you point me to how you do monitoring of underlying processes runs in nipype (and possibly pydra if you do)?
unfortunately we don't have it yet in pydra. From nipype you can check profiler, but I haven't played with this recently
also in pydra using audit mechanism: https://github.com/nipype/pydra/blob/master/pydra/utils/profiler.py)
yes, I forgot that some of the info also could be saved during the runtime in pydra...
Looks quite good and close to what we are pursuing here! besides that not sure (worth checking) if it would track properly those started in a new shell, like done in https://github.com/con/duct/blob/main/abandoning_parent.sh#L15 , and where we thought that brainlife's smon approach to track by session is a good one!
I think we can close this one, but it is always good to hear about other solutions so feel free to reopen if youve got one to discuss :)