Closed reidsunderland closed 4 months ago
Using the sr configuration language is easy for other tools to adopt, because when we do sr3 dump, it dumps the entire config as a json. so someone building a Columbo-like tool could either read in the output of dump, or they might be able the sr_GlobalState class from sr.py... haven't looked into how to do that, so it might mean some mods. but in principle, buildings things that understand the config of an entire node should not be hard.
OK, got a first patch that writes the metrics, the look like this:
fractal% more sarra_download_f20_01.json
{"flow": {"stop_requested": false, "last_housekeeping": 1702684097.2077858, "transferConnected": false, "transferConnectStart": 0, "transferConnectTime": 0, "transferRxBytes": 256603, "transferTxBytes": 0, "transf
erRxFiles": 87, "transferTxFiles": 0, "next_housekeeping": 1702684157.2077858, "transferRxLast": "20231215T234817.66847229", "msgRate": 8.631587419459613, "current_sleep": 0.1}, "post.message": {"connected": false
, "disconnectLast": 1702684098.7549658, "disconnectTime": 0.0032203197479248047, "disconnectCount": 2, "rxByteCount": 0, "rxGoodCount": 0, "rxBadCount": 0, "txByteCount": 50985, "txGoodCount": 91, "txBadCount": 1}
, "gather.message": {"connected": false, "disconnectLast": 1702684097.7512178, "disconnectTime": 1.0048370361328125, "disconnectCount": 1, "rxByteCount": 98249, "rxGoodCount": 175, "rxBadCount": 0, "txByteCount":
0, "txGoodCount": 0, "txBadCount": 0, "brokerQueuedMessageCount": 287}, "retry": {"msgs_in_download_retry": 0, "msgs_in_post_retry": 9}, "log": {"lagMax": 87.95920205116272, "lagTotal": 7636.8115911483765, "lagMes
sageCount": 87, "rejectCount": 88}, "sarracenia.transfer": {"byteRateInstant": 1159295.2049689442}}
the metrics is: transferRxLast
fractal% more sender_tsource2send_f50_01.json
{"flow": {"stop_requested": false, "last_housekeeping": 1702684012.0278788, "transferConnected": true, "transferConnectStart": 1702684203.829592, "transferConnectTime": 189.70056891441345, "transferRxBytes": 0, "t
ransferTxBytes": 1009148, "transferRxFiles": 0, "transferTxFiles": 958, "transferTxLast": "20231215T234959.566806793", "msgRate": 5.057996349006494, "current_sleep": 3.2}, "post.message": {"connected": true, "disc
onnectLast": 1702684181.2908318, "disconnectTime": 0.02028059959411621, "disconnectCount": 4, "rxByteCount": 0, "rxGoodCount": 0, "rxBadCount": 0, "txByteCount": 834602, "txGoodCount": 820, "txBadCount": 2}, "gath
er.message": {"connected": true, "disconnectLast": 1702684149.1651616, "disconnectTime": 97.30984234809875, "disconnectCount": 2, "rxByteCount": 909723, "rxGoodCount": 958, "rxBadCount": 0, "txByteCount": 0, "txGo
odCount": 0, "txBadCount": 0, "brokerQueuedMessageCount": 38}, "retry": {"msgs_in_download_retry": 0, "msgs_in_post_retry": 138}, "log": {"lagMax": 97.46203708648682, "lagTotal": 9851.86624956131, "lagMessageCount
": 958, "rejectCount": 0}, "sarracenia.transfer": {"byteRateInstant": 1588481.089361702}}
fractal%
in there is:
"transferTxLast": "20231215T234959.566806793"
There is now a last column indicating the last time something was transferred (as a difference in time from the present in seconds.)
fractal% sr3 status
status:
Component/Config Processes Connection Lag Rates
State Run Retry msg data Queued LagMax LagAvg Last %rej pubsub messages RxData TxData
----- --- ----- --- ---- ------ ------ ------ ---- ---- -------- ---- ------ ------
cpost/veille_f34 run 1/1 0 100% 0% 0 0.00s 0.00s 0.0s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
cpump/pelle_dd1_f04 stop 0/0 0 100% 0% 0 0.00s 0.00s 0.0s 79.2% 0 Bytes/s 8 msgs/s 0 Bytes/s 0 Bytes/s
cpump/pelle_dd2_f05 stop 0/0 0 100% 0% 0 0.00s 0.00s 0.0s 79.2% 0 Bytes/s 8 msgs/s 0 Bytes/s 0 Bytes/s
cpump/xvan_f14 hung 1/1 0 100% 0% 0 0.00s 0.00s 0.0s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
cpump/xvan_f15 hung 1/1 0 100% 0% 0 0.00s 0.00s 0.0s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
poll/f62 run 1/1 0 100% 0% 0 0.42s 0.15s 0.0s 0.0% 1.4 KiB/s 0 msgs/s 0 Bytes/s 0 Bytes/s
post/shim_f63 stop 0/0 - - - - - - -
post/test2_f61 stop 0/0 0 100% 0% 0 0.07s 0.06s 0.0s 0.0% 15.8 KiB/s 0 msgs/s 0 Bytes/s 0 Bytes/s
sarra/download_f20 run 3/3 0 100% 0% 0 11.25s 3.54s 0.4s 0.0% 2.0 KiB/s 1 msgs/s 561.8 KiB/s 0 Bytes/s
sender/tsource2send_f50 run 10/10 0 100% 7% 0 1.71s 0.68s 2.0s 0.0% 7.4 KiB/s 6 msgs/s 0 Bytes/s 81.2 KiB/s
shovel/pclean_f90 run 3/3 610 100% 0% 0 81.27s 75.74s 0.0s 6.0% 2.5 KiB/s 1 msgs/s 0 Bytes/s 0 Bytes/s
shovel/pclean_f92 run 3/3 0 100% 0% 0 137.20s 82.64s 0.0s 0.0% 5.5 KiB/s 3 msgs/s 0 Bytes/s 0 Bytes/s
shovel/rabbitmqtt_f22 hung 3/3 0 100% 0% 0 2.21s 0.78s 0.0s 0.0% 7.6 KiB/s 6 msgs/s 0 Bytes/s 0 Bytes/s
shovel/t_dd1_f00 run 3/3 0 100% 0% 2 11.94s 2.61s 0.0s 80.1% 2.7 KiB/s 10 msgs/s 0 Bytes/s 0 Bytes/s
shovel/t_dd2_f00 run 3/3 0 100% 0% 1 11.68s 2.62s 0.0s 80.1% 2.7 KiB/s 10 msgs/s 0 Bytes/s 0 Bytes/s
subscribe/amqp_f30 run 3/3 0 100% 0% 0 15.91s 5.01s 0.3s 0.0% 1.5 KiB/s 1 msgs/s 151.8 KiB/s 0 Bytes/s
subscribe/cclean_f91 run 3/3 265 100% 0% 0 0.00s 0.00s 0.0s 100.0% 33 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
subscribe/cdnld_f21 run 3/3 0 100% 0% 0 37.46s 7.35s 55.6s 0.3% 964 Bytes/s 1 msgs/s 13.1 KiB/s 0 Bytes/s
subscribe/cfile_f44 run 3/3 0 100% 0% 0 1.90s 0.21s 1.6s 0.0% 1.1 KiB/s 3 msgs/s 1004 Bytes/s 0 Bytes/s
subscribe/cp_f61 run 3/3 0 100% 0% 0 13.90s 2.71s 1.1s 0.0% 3.8 KiB/s 6 msgs/s 83.5 KiB/s 0 Bytes/s
subscribe/ftp_f70 run 3/3 12 100% 42% 0 283.14s 1.84s 1.0s 0.0% 1.6 KiB/s 3 msgs/s 9.0 KiB/s 0 Bytes/s
subscribe/q_f71 run 3/3 16 100% 35% 0 292.54s 4.73s 16.0s 0.1% 1.3 KiB/s 3 msgs/s 41.3 KiB/s 0 Bytes/s
subscribe/rabbitmqtt_f31 run 3/3 0 100% 0% 24 3.76s 1.05s 2.2s 0.0% 3.7 KiB/s 6 msgs/s 159.7 KiB/s 0 Bytes/s
subscribe/u_sftp_f60 run 3/3 0 100% 0% 0 11.92s 1.78s 1.7s 0.0% 3.7 KiB/s 6 msgs/s 83.3 KiB/s 0 Bytes/s
watch/f40 run 1/1 0 100% 0% 0 0.07s 0.02s 0.0s 0.0% 2.7 KiB/s 0 msgs/s 0 Bytes/s 0 Bytes/s
winnow/t00_f10 run 1/1 0 100% 0% 0 7.28s 2.83s 0.0s 51.1% 1.5 KiB/s 1 msgs/s 0 Bytes/s 0 Bytes/s
winnow/t01_f10 run 1/1 0 100% 0% 0 4.29s 1.54s 0.0s 55.6% 1.6 KiB/s 1 msgs/s 0 Bytes/s 0 Bytes/s
Total Running Configs: 23 ( Processes: 62 missing: 0 stray: 0 )
sample use in sr3 status of the metrics produced above.
The last column is great, but right now there's nothing to interpret that number.
If sr3 showed a status of "late" for configs which have last > a threshold, then it would be extremely easy for us to monitor with Nagios right away. Eventually we could get fancier monitoring.
I reopened the issue to discuss it. Maybe we want to exclusively have the interpretation done outside of sr3, that's okay and we can close this again.
@andreleblanc11 @junhu3
Also, for components that don't transfer files, it would be nice to have the 'last' metric calculated from the last post
(and if a vip is set in the config, only the instance with the VIP matters for last post time, wVip instances could continue to show -
in sr3 status)
I chose "late" because it's 4 letters and fits with the other States reported by sr3 status.
But @MagikEh is worried about the word:
"Caution to terminology, 'late' in the case of 'no obs' can lead an analyst to looking for data that doesn't exist. I'd prefer something along the lines of 'Unexpected Lapse', 'Missed Milestone', 'Checkpoint', or something to that nature"
ah... so report things that we think are a problem... by different status... interesting... if the latency (as reported by LagAvg is over a threshold, should that trigger "late" as well?
Also, I think a high rejection rate should trigger a "bad" status of some kind. we could call that "poor" for poorly tuned?
sr3 --diag status
would only show us stuff with non running or stopped states (the bad ones.) partial, late, poor ... ?
or is calling them all "late" and then just grep for late good enough?
We might also need settings to indicate what a normal "lastTransfer" is...
so have a system-wide default, and be able to tune for certain flows.
slow is a four letter word.
thinking:
running -- no problems identified. lag -- latency theshold exceeded lagTreshold (system-wide default, + per flow override) idle -- last transfer too long ago... idleThreshold slow -- transfer rate too low slowThreshold
so you do an sr3 status | grep -v run
and you should get all the stopped, part, missing, idle, slow ones.
OK, there is issue824_morestates... implements this... sort of... but it doesn't pass any tests... how did that happen? I think a regression crept into a previous merge.... have to examine.
I was wrong... somehow changing the display broke the flow tests... this will take some time.
fwiw... somehow this change broke "stop" and tickled all kinds of latent vulnerabilities with sanity, cleanup and stop... working through them... stop is still occasionally broken (doesn't break every time, but often enough...)
what I have would likely pass the flow tests, but fail in practice for operations like sr3 stop... This work is on a second branch:
issue824_morestates_2ndTry
OK, so the problem was not identifying running states ( idle and retry were missing.) and that was causing all manner of havoc. Now the static and flakey tests pass. Dynamic is... interesting...
I've noticed that sr3 sanity starts even stopped processes... which is not correct. hmm...
One of the useful features in Columbo is that it shows when each config either last received a file or last transmitted a file. Adding this information to the metrics could make it easier to develop a monitoring tool for sr3.
Columbo uses its own config file to set limits on how long is too long since last sending/receiving a file. I think it would be better to put these limits directly in the sr3 config for at least two reasons: 1) it's more intuitive (to me) to set these limits directly in the config file. It's easy to forget about Columbo's maxSettings.conf when setting up or modifying a Sundew config. 2) sr3 can know when something is wrong and report it.
For example:
metrics_maxDelayRecv
: when more time than this option's value has elapsed since the last accepted message, this config is "late".metrics_maxDelaySend
: when more time than this option's value has elapsed since the last message was posted or file was sent, this config is "late".sr3 status already takes up a lot of space on the screen, I'm not sure if it's worth trying to squeeze in detailed information about the last send/receive time, but maybe it could show "late" if a file hasn't been sent or received within the defined time limits.
I think the last reception time is always the time of the last accepted message, but the last transmitted time is a little less obvious. For configs that post, using the last post time should work but for senders that don't post, we would have to use the last successful file send time.