MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
45 stars 22 forks source link

Metrics and settings for last sent/received files #824

Closed reidsunderland closed 4 months ago

reidsunderland commented 11 months ago

One of the useful features in Columbo is that it shows when each config either last received a file or last transmitted a file. Adding this information to the metrics could make it easier to develop a monitoring tool for sr3.

Columbo uses its own config file to set limits on how long is too long since last sending/receiving a file. I think it would be better to put these limits directly in the sr3 config for at least two reasons: 1) it's more intuitive (to me) to set these limits directly in the config file. It's easy to forget about Columbo's maxSettings.conf when setting up or modifying a Sundew config. 2) sr3 can know when something is wrong and report it.

For example:

sr3 status already takes up a lot of space on the screen, I'm not sure if it's worth trying to squeeze in detailed information about the last send/receive time, but maybe it could show "late" if a file hasn't been sent or received within the defined time limits.

I think the last reception time is always the time of the last accepted message, but the last transmitted time is a little less obvious. For configs that post, using the last post time should work but for senders that don't post, we would have to use the last successful file send time.

petersilva commented 11 months ago

Using the sr configuration language is easy for other tools to adopt, because when we do sr3 dump, it dumps the entire config as a json. so someone building a Columbo-like tool could either read in the output of dump, or they might be able the sr_GlobalState class from sr.py... haven't looked into how to do that, so it might mean some mods. but in principle, buildings things that understand the config of an entire node should not be hard.

petersilva commented 10 months ago

OK, got a first patch that writes the metrics, the look like this:


fractal% more sarra_download_f20_01.json
{"flow": {"stop_requested": false, "last_housekeeping": 1702684097.2077858, "transferConnected": false, "transferConnectStart": 0, "transferConnectTime": 0, "transferRxBytes": 256603, "transferTxBytes": 0, "transf
erRxFiles": 87, "transferTxFiles": 0, "next_housekeeping": 1702684157.2077858, "transferRxLast": "20231215T234817.66847229", "msgRate": 8.631587419459613, "current_sleep": 0.1}, "post.message": {"connected": false
, "disconnectLast": 1702684098.7549658, "disconnectTime": 0.0032203197479248047, "disconnectCount": 2, "rxByteCount": 0, "rxGoodCount": 0, "rxBadCount": 0, "txByteCount": 50985, "txGoodCount": 91, "txBadCount": 1}
, "gather.message": {"connected": false, "disconnectLast": 1702684097.7512178, "disconnectTime": 1.0048370361328125, "disconnectCount": 1, "rxByteCount": 98249, "rxGoodCount": 175, "rxBadCount": 0, "txByteCount": 
0, "txGoodCount": 0, "txBadCount": 0, "brokerQueuedMessageCount": 287}, "retry": {"msgs_in_download_retry": 0, "msgs_in_post_retry": 9}, "log": {"lagMax": 87.95920205116272, "lagTotal": 7636.8115911483765, "lagMes
sageCount": 87, "rejectCount": 88}, "sarracenia.transfer": {"byteRateInstant": 1159295.2049689442}}

the metrics is: transferRxLast


fractal% more sender_tsource2send_f50_01.json
{"flow": {"stop_requested": false, "last_housekeeping": 1702684012.0278788, "transferConnected": true, "transferConnectStart": 1702684203.829592, "transferConnectTime": 189.70056891441345, "transferRxBytes": 0, "t
ransferTxBytes": 1009148, "transferRxFiles": 0, "transferTxFiles": 958, "transferTxLast": "20231215T234959.566806793", "msgRate": 5.057996349006494, "current_sleep": 3.2}, "post.message": {"connected": true, "disc
onnectLast": 1702684181.2908318, "disconnectTime": 0.02028059959411621, "disconnectCount": 4, "rxByteCount": 0, "rxGoodCount": 0, "rxBadCount": 0, "txByteCount": 834602, "txGoodCount": 820, "txBadCount": 2}, "gath
er.message": {"connected": true, "disconnectLast": 1702684149.1651616, "disconnectTime": 97.30984234809875, "disconnectCount": 2, "rxByteCount": 909723, "rxGoodCount": 958, "rxBadCount": 0, "txByteCount": 0, "txGo
odCount": 0, "txBadCount": 0, "brokerQueuedMessageCount": 38}, "retry": {"msgs_in_download_retry": 0, "msgs_in_post_retry": 138}, "log": {"lagMax": 97.46203708648682, "lagTotal": 9851.86624956131, "lagMessageCount
": 958, "rejectCount": 0}, "sarracenia.transfer": {"byteRateInstant": 1588481.089361702}}
fractal%

in there is:

 "transferTxLast": "20231215T234959.566806793"
petersilva commented 10 months ago

There is now a last column indicating the last time something was transferred (as a difference in time from the present in seconds.)


fractal% sr3 status
status: 
Component/Config                         Processes   Connection        Lag                              Rates                                        
                                         State   Run Retry  msg data   Queued   LagMax LagAvg  Last  %rej     pubsub   messages     RxData     TxData 
                                         -----   --- -----  --- ----   ------   ------ ------  ----  ----   --------       ----     ------     ------ 
cpost/veille_f34                         run     1/1     0 100%   0%      0    0.00s    0.00s  0.0s  0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
cpump/pelle_dd1_f04                      stop    0/0     0 100%   0%      0    0.00s    0.00s  0.0s 79.2%  0 Bytes/s   8 msgs/s  0 Bytes/s  0 Bytes/s
cpump/pelle_dd2_f05                      stop    0/0     0 100%   0%      0    0.00s    0.00s  0.0s 79.2%  0 Bytes/s   8 msgs/s  0 Bytes/s  0 Bytes/s
cpump/xvan_f14                           hung    1/1     0 100%   0%      0    0.00s    0.00s  0.0s  0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
cpump/xvan_f15                           hung    1/1     0 100%   0%      0    0.00s    0.00s  0.0s  0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
poll/f62                                 run     1/1     0 100%   0%      0    0.42s    0.15s  0.0s  0.0%  1.4 KiB/s   0 msgs/s  0 Bytes/s  0 Bytes/s
post/shim_f63                            stop    0/0          -          -         -     -     -          -        -
post/test2_f61                           stop    0/0     0 100%   0%      0    0.07s    0.06s  0.0s  0.0% 15.8 KiB/s   0 msgs/s  0 Bytes/s  0 Bytes/s
sarra/download_f20                       run     3/3     0 100%   0%      0   11.25s    3.54s  0.4s  0.0%  2.0 KiB/s   1 msgs/s 561.8 KiB/s  0 Bytes/s
sender/tsource2send_f50                  run   10/10     0 100%   7%      0    1.71s    0.68s  2.0s  0.0%  7.4 KiB/s   6 msgs/s  0 Bytes/s 81.2 KiB/s
shovel/pclean_f90                        run     3/3   610 100%   0%      0   81.27s   75.74s  0.0s  6.0%  2.5 KiB/s   1 msgs/s  0 Bytes/s  0 Bytes/s
shovel/pclean_f92                        run     3/3     0 100%   0%      0  137.20s   82.64s  0.0s  0.0%  5.5 KiB/s   3 msgs/s  0 Bytes/s  0 Bytes/s
shovel/rabbitmqtt_f22                    hung    3/3     0 100%   0%      0    2.21s    0.78s  0.0s  0.0%  7.6 KiB/s   6 msgs/s  0 Bytes/s  0 Bytes/s
shovel/t_dd1_f00                         run     3/3     0 100%   0%      2   11.94s    2.61s  0.0s 80.1%  2.7 KiB/s  10 msgs/s  0 Bytes/s  0 Bytes/s
shovel/t_dd2_f00                         run     3/3     0 100%   0%      1   11.68s    2.62s  0.0s 80.1%  2.7 KiB/s  10 msgs/s  0 Bytes/s  0 Bytes/s
subscribe/amqp_f30                       run     3/3     0 100%   0%      0   15.91s    5.01s  0.3s  0.0%  1.5 KiB/s   1 msgs/s 151.8 KiB/s  0 Bytes/s
subscribe/cclean_f91                     run     3/3   265 100%   0%      0    0.00s    0.00s  0.0s 100.0% 33 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
subscribe/cdnld_f21                      run     3/3     0 100%   0%      0   37.46s    7.35s 55.6s  0.3% 964 Bytes/s   1 msgs/s 13.1 KiB/s  0 Bytes/s
subscribe/cfile_f44                      run     3/3     0 100%   0%      0    1.90s    0.21s  1.6s  0.0%  1.1 KiB/s   3 msgs/s 1004 Bytes/s  0 Bytes/s
subscribe/cp_f61                         run     3/3     0 100%   0%      0   13.90s    2.71s  1.1s  0.0%  3.8 KiB/s   6 msgs/s 83.5 KiB/s  0 Bytes/s
subscribe/ftp_f70                        run     3/3    12 100%  42%      0  283.14s    1.84s  1.0s  0.0%  1.6 KiB/s   3 msgs/s  9.0 KiB/s  0 Bytes/s
subscribe/q_f71                          run     3/3    16 100%  35%      0  292.54s    4.73s 16.0s  0.1%  1.3 KiB/s   3 msgs/s 41.3 KiB/s  0 Bytes/s
subscribe/rabbitmqtt_f31                 run     3/3     0 100%   0%     24    3.76s    1.05s  2.2s  0.0%  3.7 KiB/s   6 msgs/s 159.7 KiB/s  0 Bytes/s
subscribe/u_sftp_f60                     run     3/3     0 100%   0%      0   11.92s    1.78s  1.7s  0.0%  3.7 KiB/s   6 msgs/s 83.3 KiB/s  0 Bytes/s
watch/f40                                run     1/1     0 100%   0%      0    0.07s    0.02s  0.0s  0.0%  2.7 KiB/s   0 msgs/s  0 Bytes/s  0 Bytes/s
winnow/t00_f10                           run     1/1     0 100%   0%      0    7.28s    2.83s  0.0s 51.1%  1.5 KiB/s   1 msgs/s  0 Bytes/s  0 Bytes/s
winnow/t01_f10                           run     1/1     0 100%   0%      0    4.29s    1.54s  0.0s 55.6%  1.6 KiB/s   1 msgs/s  0 Bytes/s  0 Bytes/s
      Total Running Configs:  23 ( Processes: 62 missing: 0 stray: 0 )

sample use in sr3 status of the metrics produced above.

reidsunderland commented 5 months ago

The last column is great, but right now there's nothing to interpret that number.

If sr3 showed a status of "late" for configs which have last > a threshold, then it would be extremely easy for us to monitor with Nagios right away. Eventually we could get fancier monitoring.

I reopened the issue to discuss it. Maybe we want to exclusively have the interpretation done outside of sr3, that's okay and we can close this again.

@andreleblanc11 @junhu3

reidsunderland commented 5 months ago

Also, for components that don't transfer files, it would be nice to have the 'last' metric calculated from the last post

(and if a vip is set in the config, only the instance with the VIP matters for last post time, wVip instances could continue to show - in sr3 status)

reidsunderland commented 5 months ago

I chose "late" because it's 4 letters and fits with the other States reported by sr3 status.

But @MagikEh is worried about the word:

"Caution to terminology, 'late' in the case of 'no obs' can lead an analyst to looking for data that doesn't exist. I'd prefer something along the lines of 'Unexpected Lapse', 'Missed Milestone', 'Checkpoint', or something to that nature"

petersilva commented 5 months ago

ah... so report things that we think are a problem... by different status... interesting... if the latency (as reported by LagAvg is over a threshold, should that trigger "late" as well?

Also, I think a high rejection rate should trigger a "bad" status of some kind. we could call that "poor" for poorly tuned?

sr3 --diag status

would only show us stuff with non running or stopped states (the bad ones.) partial, late, poor ... ?

petersilva commented 5 months ago

or is calling them all "late" and then just grep for late good enough?

petersilva commented 5 months ago

We might also need settings to indicate what a normal "lastTransfer" is...

so have a system-wide default, and be able to tune for certain flows.

petersilva commented 5 months ago

slow is a four letter word.

petersilva commented 5 months ago

thinking:

running -- no problems identified. lag -- latency theshold exceeded lagTreshold (system-wide default, + per flow override) idle -- last transfer too long ago... idleThreshold slow -- transfer rate too low slowThreshold

so you do an sr3 status | grep -v run

and you should get all the stopped, part, missing, idle, slow ones.

petersilva commented 5 months ago

OK, there is issue824_morestates... implements this... sort of... but it doesn't pass any tests... how did that happen? I think a regression crept into a previous merge.... have to examine.

petersilva commented 5 months ago

I was wrong... somehow changing the display broke the flow tests... this will take some time.

petersilva commented 5 months ago

fwiw... somehow this change broke "stop" and tickled all kinds of latent vulnerabilities with sanity, cleanup and stop... working through them... stop is still occasionally broken (doesn't break every time, but often enough...)

what I have would likely pass the flow tests, but fail in practice for operations like sr3 stop... This work is on a second branch:

issue824_morestates_2ndTry

petersilva commented 5 months ago

OK, so the problem was not identifying running states ( idle and retry were missing.) and that was causing all manner of havoc. Now the static and flakey tests pass. Dynamic is... interesting...

I've noticed that sr3 sanity starts even stopped processes... which is not correct. hmm...