StormSurgeLive / asgs-mon

Vigilant watchdog of ASGS.
0 stars 0 forks source link

asgs-mon next steps #1

Open wwlwpd opened 3 months ago

wwlwpd commented 3 months ago

Integration with asgs_main.sh

ps aux | grep [a]sgs_main.sh | grep "$ASGS_CONFIG" | awk '{ print $2 }'
# snippet of my proposed $ASGS_CONF, with some context variables to demonstrate what I am copying
OPENDAPPOST=opendap_post2.sh
POSTPROCESS=( includeWind10m.sh createOPeNDAPFileList.sh $OPENDAPPOST )
ASGSMON=( 000-asgs_main-pid-check 001-instance-status-check 002-hook-status-check 003-syslog-progress 004-instance-status-progress 005-hook-status-progress 006-rundir-du 007-failed-dir )

Triage the following issues as they related,

Additional plugins to create:

Future check ideas:

Additional asgs_main.sh integration could be:

  1. start asgs-mon with run command, pass directly --pid $$
  2. fork and detach from parent process so that asgs-mon doesn't go away
  3. output a log that can be tail'd for output
  4. ASGSH command that can tail -f this output log so it can be observed at will
wwlwpd commented 3 months ago

Just adding a note here from my experience using the monitor.

  1. each asgs-mon instance itself should be tied to an asgs_main.sh
  2. it can guess it (the usual case) or can be givin a --pid to watch
  3. if the ASGS_PID goes away (asgs-mon should actually watch for this), then it should pause until one is found (it can check periodically)

The issue I am finding is that asgs-mon is happily continuing along when the ASGS_PID goes away, and this is pretty useless.

wwlwpd commented 3 months ago

Another thought after experiencing this for a few days now and monitoring 4 different systems, it'd be nice to get one summary email per notification window (1-3 hours) that had the subject summary:

$HPCENVSHORT $PROFILE: M Critical, N Warnings, P Notifications, Q Unknowns

... emumerated summary of warnings

Example,

subject: qbd HSOFS_nam: 1 Critical, 2 Warning, 1 Notification

Body:

Summary,
...

That decide upfront that only actionable things get emails; e.g.,

  1. failed directories detected
  2. asgs_main.sh PID went away

I am also not finding the "still alive" heart beat emails super useful.

wwlwpd commented 3 months ago

Another idea, make a plugin for asgs-lint

wwlwpd commented 3 months ago

I am going to peel the monitor off into it's own repo