con / duct

A helper to run a command, capture stdout/stderr and details about running
MIT License
1 stars 1 forks source link

research other possible existing solutions #2

Closed yarikoptic closed 2 months ago

yarikoptic commented 2 months ago
soichih commented 2 months ago

Hi @yarikoptic! Thanks for the shout-out.:)

It's been a while I've worked on brainlife so things might have changed, but for all the jobs executed via brainlife, it starts a service called smon ("service monitor" or "session monitor"). This script runs the ps command periodically to gather runtime information about all the processes that are running under the same session ID (ps -s <session id>). Please see https://github.com/brainlife/abcd-spec/blob/master/hooks/smon#L144. It writes out information into a local file, and brainlife's amaretti later picks it up and uses it for visualization purposes. Some PBS/slurm/condor clusters have the monitoring plugin installed, but it was non-homogeneous and challenging to normalize the information across different systems if they were even available. smon is simple, yet it can uns on pretty much any system. A job can spawn a bunch of different child processes, so capturing the entire process session is important to monitor the entire job execution.

For visualization, the code to display the smon output is built into the brainlife (warehouse)'s web UI so I am not sure if you are interested in using that. All it does is just display the smon output in a simple timeseries chart.

Cheers!

asmacdo commented 2 months ago

time

Possibly the simplest option `time`, can collect most of the metrics we might need. Heres a list of metrics I thought might be interesting, from `man time` Time: - Elapsed real time (in [hours:]minutes:seconds). - Total number of CPU-seconds that the process spent in kernel mode. - Total number of CPU-seconds that the process spent in user mode. Memory - Maximum resident set size of the process during its lifetime, in Kbytes. - Average total (data+stack+text) memory use of the process, in Kbytes. - Number of times the process was swapped out of main memory. I/O - Number of filesystem inputs by the process. - Number of filesystem outputs by the process.
yarikoptic commented 2 months ago

FWIW here is our attempt to play with smon by starting a new session with bash process under a new SID to run smon and target compute:

ps -p $$ -o sid=; rm -f _smon.out; setsid bash -c 'ps -p $$ -o sid=; /home/yoh/proj/misc/abcd-spec/hooks/smon & jobs ; ./abandoning_parent.sh 5 ./consume_mem.py 1020 2 ; kill %1'

might need to kill it more gently or make it actually react to KILL signal and cause dump of stats.

yarikoptic commented 2 months ago

@satra @djarecka -- could you point me to how you do monitoring of underlying processes runs in nipype (and possibly pydra if you do)?

djarecka commented 2 months ago

unfortunately we don't have it yet in pydra. From nipype you can check profiler, but I haven't played with this recently

satra commented 2 months ago

also in pydra using audit mechanism: https://github.com/nipype/pydra/blob/master/pydra/utils/profiler.py)

djarecka commented 2 months ago

yes, I forgot that some of the info also could be saved during the runtime in pydra...

yarikoptic commented 2 months ago

Looks quite good and close to what we are pursuing here! besides that not sure (worth checking) if it would track properly those started in a new shell, like done in https://github.com/con/duct/blob/main/abandoning_parent.sh#L15 , and where we thought that brainlife's smon approach to track by session is a good one!

asmacdo commented 2 months ago

I think we can close this one, but it is always good to hear about other solutions so feel free to reopen if youve got one to discuss :)