Closed endlisnis closed 1 year ago
This is super interesting... what I see in the log, general trend:
Periodic "housekeeping intervals" are supposed to happen every 'housekeeping' interval, which is, by default, 5 minutes. The actual intervals are all over the map... sample interval from beginning of the file: 09,23,37,50,03,07,21,34,48 ... Now it only checks to run housekeeping between transfer batches... so it looks like individual batches are taking 20 minutes...
The processing of messages/files is done in batches, and you see a after_accept messages happen @ 0h09, then it is quiet for 15 minutes (until 0h23) it is downloading all those files. It is not that it was sitting there doing nothing, it's that the success log is only happenning after the entire batch is complete.
It is the transfers themselves from the datamart that are pathologically slow at those times... so the built-in transfers are slowing down... guesses:
hmm... I'm going to look into some periodic plugins to report memory usage. in the meantime, you can restart every couple of days with a cron job, and that might prevent a recurrence, or if you are willing to tolerate a recurrence, I will provide a memory reporter, and We already have a bug outstanding where periodic processing is supposed to restart when memory leaks are detected ( #395 )
conclusion:
subscribe_rolf_01.log.2022-01-08.gz
Here's yesterday's log file that shows the slowdown start in the evening.
b430327dac6bf12ef42cc3db0a92b2a3a6fd4953 337e2632835aa8ea40c10f8836b7b1c0b103dfa1
messages for average latency produced at housekeeping intervals should be more helpful now... metrics are per housekeeping interval, rather than cumulative from start.
It happened again: subscribe_rolf_01.log.2022-01-15.gz
You can ignore the 3am slowdown, that is caused by some local processes. But at around 9:00am, it got very slow again and only sped up after I restarted it.
looking at the log, exactly as you say... I'd like some memory information to be in the housekeeping reports... still waiting on #395
There is something else going on here too. No matter how slow it gets, it never falls more than about 1770 seconds behind; and if I get an outage longer than ~1770 seconds, it resumes about 1770 seconds out (instead of the 4h that I have in my config).
It seems as if the queue is deleting messages after around 1770 seconds.
bunch of different possibilities:
how many updates do you get per hour approx?
max-length:10000 for anonymous queues
only ten thousand... yeah that makes more sense... it would do that...
I get ~20k items per hour, which makes the 1700 second limit make sense.
fwiw @MagikEh is making great progress on #395 ... I think it will solve this.
now that #395 is merged... @endlisnis you should update your configuration and stop/start. With the new code if the memory usage gets too large, it should restart on it´s own, so you don´t need to do anything.
I updated my clients. Let's wait a week and see if I have this problem again (I was having it on average about once every 2 days).
The clients ran fine for many days, but they've been acting very strange today. I'm not sure if it's the same problem, but I keep falling behind; but there seems to be some mix of very old and new entries in the log: subscribe_rolf_01.log.gz
I tried restarting it, but it seems to have just fallen behind again.
I also (in the above log) get a bunch of messages like "download incomplete download only 9452 of expected 9455 bytes for 2022-02-23-1740-CVYD-AUTO-minute-swob.xml" when I restart the clients.
There were lots of problems with internal systems yesterday. The feeds to to DMZ systems were affected. I don't think anything yesterday is representative of a problem with your feed.
Is it back to normal today?
Yes, it seems fine today.
I've been suffering problems over the last few days that also cause my client to fall way behind. I restarted them about 15 minutes ago and they quickly caught up. Here is a log from instance 1: subscribe_rolf_01.log.gz
great log... It isn't clear when you restarted... It doesn't look like it logged anything on shutdown or startup. from your description and the log, I'm guessing 20:26?
it started slowing here:
2022-03-23 12:57:56,825 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 3.00 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CVQZ/2022-03-23-1657-CVQZ-AUTO-minute-swob.xml
2022-03-23 12:57:56,825 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 3.00 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWAQ/2022-03-23-1657-CWAQ-AUTO-minute-swob.xml
2022-03-23 12:57:56,825 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.96 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWZE/2022-03-23-1657-CWZE-AUTO-minute-swob.xml
2022-03-23 12:57:56,825 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.96 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWKH/2022-03-23-1657-CWKH-AUTO-minute-swob.xml
2022-03-23 12:57:56,825 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.45 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWBM/2022-03-23-1657-CWBM-AUTO-minute-swob.xml
2022-03-23 12:59:17,937 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CXKI/2022-03-23-1657-CXKI-AUTO-minute-swob.xml
2022-03-23 12:59:17,937 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CXOX/2022-03-23-1657-CXOX-AUTO-minute-swob.xml
2022-03-23 12:59:17,937 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CXPT/2022-03-23-1657-CXPT-AUTO-minute-swob.xml
2022-03-23 12:59:17,937 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CXKE/2022-03-23-1656-CXKE-AUTO-minute-swob.xml
2022-03-23 12:59:17,937 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CWEW/2022-03-23-1657-CWEW-AUTO-minute-swob.xml
2022-03-23 12:59:17,937 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CVQZ/2022-03-23-1657-CVQZ-AUTO-minute-swob.xml
2022-03-23 12:59:17,938 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CWAQ/2022-03-23-1657-CWAQ-AUTO-minute-swob.xml
2022-03-23 12:59:17,938 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CWZE/2022-03-23-1657-CWZE-AUTO-minute-swob.xml
2022-03-23 12:59:17,938 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CWKH/2022-03-23-1657-CWKH-AUTO-minute-swob.xml
2022-03-23 12:59:17,938 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CWBM/2022-03-23-1657-CWBM-AUTO-minute-swob.xml
2022-03-23 12:59:20,440 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 61.79 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWDV/2022-03-23-1657-CWDV-AUTO-minute-swob.xml
2022-03-23 12:59:20,440 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 61.79 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CAMS/2022-03-23-1657-CAMS-AUTO-minute-swob.xml
2022-03-23 12:59:20,440 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 61.60 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CVXC/2022-03-23-1657-CVXC-AUTO-minute-swob.xml
2022-03-23 12:59:20,440 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 62.87 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWPN/2022-03-23-1657-CWPN-AUTO-minute-swob.xml
then was stable for a little while until
022-03-23 12:59:20,452 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 61.10 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CVOQ/2022-03-23-1657-CVOQ-AUTO-minute-swob.xml
2022-03-23 12:59:20,452 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 61.10 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWKX/2022-03-23-1657-CWKX-AUTO-minute-swob.xml
2022-03-23 12:59:20,452 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 61.10 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWOK/2022-03-23-1657-CWOK-AUTO-minute-swob.xml
2022-03-23 12:59:20,452 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 701.81 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CPUN/2022-03-23-1646-CPUN-AUTO-minute-swob.xml
2022-03-23 12:59:20,452 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 701.81 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWDJ/2022-03-23-1646-CWDJ-AUTO-minute-swob.xml
2022-03-23 12:59:20,452 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 701.81 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CVSJ/2022-03-23-1646-CVSJ-AUTO-minute-swob.xml
then it got gradually worse, until the worst it got was here:
2022-03-23 15:50:06,217 [INFO] sarracenia.flowcb.log housekeeping_stats lag: average: 1770.75, maximum: 1772.01
2022-03-23 15:50:06,217 [INFO] sarracenia.flowcb.log on_housekeeping housekeeping
2022-03-23 15:50:08,470 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CPOX/2022-03-23-1919-CPOX-AUTO-minute-swob.xml
2022-03-23 15:50:08,470 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CVFN/2022-03-23-1919-CVFN-AUTO-minute-swob.xml
2022-03-23 15:50:08,471 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWZT/2022-03-23-1919-CWZT-AUTO-minute-swob.xml
2022-03-23 15:50:08,471 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CPUN/2022-03-23-1919-CPUN-AUTO-minute-swob.xml
2022-03-23 15:50:08,471 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1806.14 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CVYD/2022-03-23-1919-CVYD-AUTO-minute-swob.xml
2022-03-23 15:50:08,471 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1806.14 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CTZN/2022-03-23-1919-CTZN-AUTO-minute-swob.xml
2022-03-23 15:50:08,471 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CZGH/2022-03-23-1919-CZGH-AUTO-minute-swob.xml
2022-03-23 15:50:08,472 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWIW/2022-03-23-1919-CWIW-AUTO-minute-swob.xml
2022-03-23 15:50:08,472 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1806.14 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CZKA/2022-03-23-1919-CZKA-AUTO-minute-swob.xml
2022-03-23 15:50:08,472 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1806.14 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CXET/2022-03-23-1919-CXET-AUTO-minute-swob.xml
2022-03-23 15:50:08,472 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CTPM/2022-03-23-1919-CTPM-AUTO-minute-swob.xml
2022-03-23 15:50:08,472 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CPQD/2022-03-23-1919-CPQD-AUTO-minute-swob.xml
2022-03-23 15:50:08,472 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1806.14 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/COGE/2022-03-23-1919-COGE-AUTO-minute-swob.xml
2022-03-23 15:50:08,473 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.65 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CTUX/2022-03-23-1919-CTUX-AUTO-minute-swob.xml
2022-03-23 15:50:08,473 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1806.14 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CASF/2022-03-23-1919-CASF-AUTO-minute-swob.xml
2022-03-23 15:50:08,473 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.65 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWNH/2022-03-23-1919-CWNH-AUTO-minute-swob.xml
2022-03-23 15:50:08,473 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1806.14 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CPYQ/2022-03-23-1919-CPYQ-AUTO-minute-swob.xml
then it was intermittently recovering for a while until this weirdness happenned:
2022-03-23 17:11:13,001 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1576.31 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWCO/2022-03-23-2044-CWCO-AUTO-minute-swob.xml
2022-03-23 17:11:13,001 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1576.35 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWPZ/2022-03-23-2044-CWPZ-AUTO-minute-swob.xml
2022-03-23 17:11:13,001 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1576.31 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CXFV/2022-03-23-2044-CXFV-AUTO-minute-swob.xml
2022-03-23 17:11:13,001 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1576.35 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWVT/2022-03-23-2044-CWVT-AUTO-minute-swob.xml
2022-03-23 17:22:02,796 [ERROR] sarracenia.transfer read_writelocal util/writelocal mismatched file length writing 2022-03-23-2039-CWXA-AUTO-minute-swob.xml. Message said to expect 8564 bytes. Got 8563 bytes.
2022-03-23 17:22:02,796 [ERROR] sarracenia.flow download incomplete download only 8563 of expected 8564 bytes for 2022-03-23-2039-CWXA-AUTO-minute-swob.xml
2022-03-23 17:22:02,796 [INFO] sarracenia.flow do_download attempt 1 failed to download https://dd5.weather.gc.ca//observations/swob-ml/20220323/CWXA/2022-03-23-2039-CWXA-AUTO-minute-swob.xml to /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CWXA/2022-03-23-2039-CWXA-AUTO-minute-swob.xml
2022-03-23 17:22:02,796 [WARNING] sarracenia.flow do_download downloading again, attempt 2
2022-03-23 17:22:10,904 [ERROR] sarracenia.transfer read_writelocal util/writelocal mismatched file length writing 2022-03-23-2039-CWXA-AUTO-minute-swob.xml. Message said to expect 8564 bytes. Got 8563 bytes.
2022-03-23 17:22:10,904 [ERROR] sarracenia.flow download incomplete download only 8563 of expected 8564 bytes for 2022-03-23-2039-CWXA-AUTO-minute-swob.xml
2022-03-23 17:22:10,904 [INFO] sarracenia.flow do_download attempt 2 failed to download https://dd5.weather.gc.ca//observations/swob-ml/20220323/CWXA/2022-03-23-2039-CWXA-AUTO-minute-swob.xml to /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CWXA/2022-03-23-2039-CWXA-AUTO-minute-swob.xml
2022-03-23 17:22:10,904 [WARNING] sarracenia.flow do_download downloading again, attempt 3
2022-03-23 17:22:19,008 [ERROR] sarracenia.transfer read_writelocal util/writelocal mismatched file length writing 2022-03-23-2039-CWXA-AUTO-minute-swob.xml. Message said to expect 8564 bytes. Got 8563 bytes.
2022-03-23 17:22:19,008 [ERROR] sarracenia.flow download incomplete download only 8563 of expected 8564 bytes for 2022-03-23-2039-CWXA-AUTO-minute-swob.xml
2022-03-23 17:22:19,008 [INFO] sarracenia.flow do_download attempt 3 failed to download https://dd5.weather.gc.ca//observations/swob-ml/20220323/CWXA/2022-03-23-2039-CWXA-AUTO-minute-swob.xml to /home/rolf/weather/swobMirror
after recovering from that, then it was suddenly only 5 minutes behind, instead of 30:
022-03-23 17:25:27,174 [INFO] sarracenia.diskqueue on_housekeeping post_retry_001 on_housekeeping
2022-03-23 17:25:27,174 [INFO] sarracenia.diskqueue on_housekeeping No retry in list
2022-03-23 17:25:27,174 [INFO] sarracenia.diskqueue on_housekeeping on_housekeeping elapse 0.000157
2022-03-23 17:25:27,174 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current Memory cpu_times: user=3025.45 system=575.89 elapse=17455928.9
2022-03-23 17:25:27,174 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current Memory usage: 75.3 MiB / 224.3 MiB = 33.56%
2022-03-23 17:25:27,174 [INFO] sarracenia.flowcb.log housekeeping_stats messages_received: 100, accepted: 97, rejected: 3 ( bytes: 2134 )
2022-03-23 17:25:27,174 [INFO] sarracenia.flowcb.log housekeeping_stats files transferred: 93, cumulative bytes of data: 858293
2022-03-23 17:25:27,174 [INFO] sarracenia.flowcb.log housekeeping_stats lag: average: 1586.41, maximum: 1613.64
2022-03-23 17:25:27,174 [INFO] sarracenia.flowcb.log on_housekeeping housekeeping
2022-03-23 17:25:29,284 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 297.36 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWXA/2022-03-23-2020-CWXA-AUTO-minute-swob.xml
2022-03-23 17:25:29,284 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 297.36 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CACQ/2022-03-23-2043-CACQ-AUTO-minute-swob.xml
2022-03-23 17:25:29,284 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 296.33 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CVYD/2022-03-23-2043-CVYD-AUTO-minute-swob.xml
2022-03-23 17:25:29,284 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 296.33 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CABF/2022-03-23-2043-CABF-AUTO-minute-swob.xml
2022-03-23 17:25:29,284 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 296.33 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWPN/2022-03-23-2043-CWPN-AUTO-minute-swob.xml
2022-03-23 17:25:29,284 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 296.33 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWJV/2022-03-23-2044-CWJV-AUTO-minute-swob.xml
but the lag lengthened again... gradually... and the transfers went slowly, so it fell behind.
2022-03-23 17:51:45,770 [INFO] sarracenia.flowcb.log housekeeping_stats lag: average: 956.50, maximum: 962.80
2022-03-23 17:51:45,770 [INFO] sarracenia.flowcb.log on_housekeeping housekeeping
2022-03-23 17:51:48,669 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1593.56 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CXKI/2022-03-23-2121-CXKI-AUTO-minute-swob.xml
2022-03-23 17:51:48,669 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1592.77 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWFF/2022-03-23-2121-CWFF-AUTO-minute-swob.xml
2022-03-23 17:51:48,669 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1593.56 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CXRB/2022-03-23-2121-CXRB-AUTO-minute-swob.xml
lag wandered around again, and built up to 30 minutes (1800 seconds) again. peaked at 1900. and gradually recovered to <10... then: weird grouping of old messages. My guess is that when it was restarted. when the re-connection happenned, it resent some old messages that had never been acknowledged:
022-03-23 20:26:09,926 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220324/CNCD/2022-03-24-0025-CNCD-AUTO-minute-swob.xml
2022-03-23 20:26:09,926 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220324/CWCH/2022-03-24-0025-CWCH-AUTO-minute-swob.xml
2022-03-23 20:26:09,926 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220324/CTPM/2022-03-24-0025-CTPM-AUTO-minute-swob.xml
2022-03-23 20:26:09,948 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 13274.32 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWXA/2022-03-23-2039-CWXA-AUTO-minute-swob.xml
2022-03-23 20:26:09,948 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 13274.32 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWXA/2022-03-23-2035-CWXA-AUTO-minute-swob.xml
2022-03-23 20:26:09,948 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 13274.31 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWXA/2022-03-23-2033-CWXA-AUTO-minute-swob.xml
2022-03-23 20:26:09,948 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 13273.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWXA/2022-03-23-2043-CWXA-AUTO-minute-swob.xml
2022-03-23 20:26:10,046 [ERROR] sarracenia.transfer read_writelocal util/writelocal mismatched file length writing 2022-03-23-2039-CWXA-AUTO-minute-swob.xml. Message said to expect 8564 bytes. Got 8563 bytes.
2022-03-23 20:26:10,046 [ERROR] sarracenia.flow download incomplete download only 8563 of expected 8564 bytes for 2022-03-23-2039-CWXA-AUTO-minute-swob.xml
followed immediately by <2 second lag...
2022-03-23 20:26:13,498 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.28 ) https://dd5.weather.gc.ca /observations/swob-ml/20220324/CWLJ/2022-03-24-0025-CWLJ-AUTO-minute-swob.xml
2022-03-23 20:26:13,498 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.27 ) https://dd5.weather.gc.ca /observations/swob-ml/20220324/CZEV/2022-03-24-0025-CZEV-AUTO-minute-swob.xml
2022-03-23 20:26:13,498 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.27 ) https://dd5.weather.gc.ca /observations/swob-ml/20220324/CNEK/2022-03-24-0025-CNEK-AUTO-minute-swob.xml
2022-03-23 20:26:13,498 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.27 ) https://dd5.weather.gc.ca /observations/swob-ml/20220324/CWJR/2022-03-24-0025-CWJR-AUTO-minute-swob.xml
2022-03-23 20:26:13,499 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220324/CZGH/2022-03-24-0025-CZGH-AUTO-minute-swob.xml
2022-03-23 20:26:13,499 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220324/CVFN/2022-03-24-0025-CVFN-AUTO-minute-swob.xml
2022-03-23 20:26:13,499 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220324/CMSI/2022-03-24-0025-CMSI-AUTO-minute-swob.xml
It doesn't look like it logged anything on shutdown or startup
When I restarted it, none of the instances responded to the TERM requests. They ended up getting killed after a while.
This is the timing of the restart:
2022-03-23 20:19:25,571 [INFO] root stop_signal signal 15 received
I also noticed that sometimes it saves partial downloaded files. This is really brutal for my application. I have a periodic task that runs once-per-hour that tries to poll for files that were missed; but since PART of these files were downloaded, I can't easily tell that I should try to download it again.
I'm having to check the file extension, and if it's .xml, then I have to see if ">" is the last character in the file. If it is not, then assume this is corrupt and download it again.
For my application, it would be much better if partial downloads were thrown away.
I wrote my own downloader rather than the built-in one.
I have a custom pluging where serra just sends notification to my downloader. My downloader
fwiw @matthewdarwin ... you can do all that built-in downloader:
inflight tmp/
would download the file to a subdirectory named tmp instead ot files named .tmp and then move them when complete.
acceptSizeWrong off
will have it fail the downloads of files when the size is wrong, and it will retry them automatically.
instances 4
you can control how many downloaders operate in parellel with instances. we have cases with 40 or more processes munching on the same queue. The acceptSizeWrong is perhaps only in sr3, but the others have been there since 2018 at least.
To prioritize things, configure a separate subscriptions for the high priority stuff vs. the others (two separate configurations) so that there is a high priority queue and and another queue for everything else.
@endlisnis the option:
for the wrong size giving you headaches... short version add a line like this to your config:
acceptSizeWrong off
Thanks for the hints. I need to prioritize downloads amongst the background scraper jobs too, which don't run via sarra. If internet connection is down for a few hours, there are many files to download and catch up.
hey @matthewdarwin if the download CLI isn't giving any value to you and you are using custom code anyways, you might want to use the (new) python API, and not bother with the CLI at all. Demonstration here:
acceptSizeWrong off
should be the default in v3. For 99% of the use cases I can think of, keeping partial files is wrong and complicated.
@endlisnis yes... looking at the code... I'm confused... because it already is the default... why it would not retry in that case is puzzling... looking.... wow... it did reject the file... it wasn't happy. it just failed to queue it up for retry... but it got it half-right, in that the reception was never acknowledged, so that later:
2022-03-23 20:26:09,948 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 13274.31 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWXA/2022-03-23-2033-CWXA-AUTO-minute-swob.xml
so it's a bug where it has kind of half-detected that the transfer failed... AMQP decided to resend the message when re-connection happenned. why didn't it get queued for retry? oh... it did:
2022-03-23 19:38:39,792 [INFO] sarracenia.diskqueue on_housekeeping Number of messages in retry list 4
2022-03-23 19:38:39,793 [INFO] sarracenia.diskqueue on_housekeeping on_housekeeping elapse 0.006980
2022-03-23 19:38:39,793 [INFO] sarracenia.diskqueue on_housekeeping post_retry_001 on_housekeeping
2022-03-23 19:38:39,793 [INFO] sarracenia.diskqueue on_housekeeping No retry in list
so getting on the retry list is ok. but then it prints 'No Retry in list'... so ... it's like it cannot read from it. Hey @reidsunderland this looks like the same case you were seeing in the flow test? it's real! stuff is getting put on the retry list, but fetching from it isn't working in at least one real case.
So there is a problem in the retry logic... we will work on that. I think @reidsunderland might already on it in the context of another issue.
Yes! It seems to be related to, or the same problem that I'm working on for #466.
I continue to have daily problems with my feeds falling behind (often lag exceeding 400 seconds, sometimes getting up near 1500 seconds). Do you need any more logs from me?
I have the fix for #488, but I (again) experienced a slowdown today. subscribe_rolf_01.log.gz
was going up and down, as bad as 5 minutes, but usually only a few seconds, until:
2022-03-28 02:19:19,697 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CXWG/2022-03-28-0618-CXWG-AUTO-minute-swob.xml
2022-03-28 02:19:19,697 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWJR/2022-03-28-0618-CWJR-AUTO-minute-swob.xml
2022-03-28 02:19:19,697 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CTPQ/2022-03-28-0618-CTPQ-AUTO-minute-swob.xml
2022-03-28 02:19:19,697 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWLJ/2022-03-28-0617-CWLJ-AUTO-minute-swob.xml
2022-03-28 03:25:28,486 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220328/CZPK/2022-03-28-0601-CZPK-AUTO-minute-swob.xml
2022-03-28 03:25:28,487 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220328/CARP/2022-03-28-0618-CARP-AUTO-minute-swob.xml
20
then it hung for an hour or so...
2022-03-28 03:25:28,495 [INFO] sarracenia.flowcb.log stats lag: average: 4.07, maximum: 19.37
2022-03-28 03:25:28,495 [INFO] sarracenia.flowcb.log on_housekeeping housekeeping
2022-03-28 03:25:31,292 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1603.51 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CVGE/2022-03-28-0653-CVGE-AUTO-minute-swob.xml
2022-03-28 03:25:31,293 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1603.51 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWXB/2022-03-28-0658-CWXB-AUTO-minute-swob.xml
2022-03-28 03:25:31,293 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1597.86 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWPU/2022-03-28-0658-CWPU-AUTO-minute-swob.xml
2022-03-28 03:25:31,293 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1597.86 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/COGE/2022-03-28-0658-COGE-AUTO-minute-swob.xml
2022-03-28 03:25:31,293 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1598.39 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CVRA/2022-03-28-0658-CVRA-AUTO-minute-swob.xml
caught up in 4 minutes after restoral:
2022-03-28 03:39:55,068 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 0.50 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CVBB/2022-03-28-0739-CVBB-AUTO-minute-swob.xml
2022-03-28 03:39:55,069 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 0.49 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWVQ/2022-03-28-0739-CWVQ-AUTO-minute-swob.xml
2022-03-28 03:39:55,069 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 0.49 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWUS/2022-03-28-0739-CWUS-AUTO-minute-swob.xml
2022-03-28 03:39:55,590 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220328/CVBB/2022-03-28-0739-CVBB-AUTO-minute-swob.xml
The symptom is different. It doesn't show any error message, but just hangs. did you restart around 3:25 ? perhaps a cron job? anyways, then it was good for about 6 hours, and then it gradually slowed down:
2022-03-28 09:40:50,470 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 8.07 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWCT/2022-03-28-1339-CWCT-AUTO-minute-swob.xml
2022-03-28 09:40:50,471 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 7.91 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWTE/2022-03-28-1339-CWTE-AUTO-minute-swob.xml
2022-03-28 09:40:50,471 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 8.05 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CNZS/2022-03-28-1339-CNZS-AUTO-minute-swob.xml
2022-03-28 09:47:37,401 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220328/CMGZ/2022-03-28-1339-CMGZ-AUTO-minute-swob.xml
.
.
.
2022-03-28 09:47:39,509 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 400.76 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWSZ/2022-03-28-1340-CWSZ-AUTO-minute-swob.xml
2022-03-28 09:47:39,509 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 400.75 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CXCA/2022-03-28-1340-CXCA-AUTO-minute-swob.xml
... then a 13 minute delay while downloading occurs... this 13 minutes represents the time before start of download and completion for the batch of files. accepted above.
2022-03-28 09:47:39,512 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 397.56 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWEI/2022-03-28-1340-CWEI-AUTO-minute-swob.xml
2022-03-28 10:01:12,261 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220328/CWSZ/2022-03-28-1340-CWSZ-AUTO-minute-swob.xml
.
.
.
... in the same second, it receives the next batch of messages:
2022-03-28 10:01:12,263 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220328/CWEI/2022-03-28-1340-CWEI-AUTO-minute-swob.xml
2022-03-28 10:01:12,263 [INFO] sarracenia.flow run on_housekeeping pid: 3385964 subscribe/rolf instance: 1
2022-03-28 10:01:12,263 [INFO] sarracenia.flowcb.gather.message on_housekeeping messages: good: 100 bad: 0 bytes: 11.9 KiB average: 121 Bytes
... now it delayed by about 18 minutes...
2022-03-28 10:01:12,264 [INFO] sarracenia.flowcb.log on_housekeeping housekeeping
2022-03-28 10:01:14,423 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1136.29 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWLV/2022-03-28-1341-CWLV-AUTO-minute-swob.xml
2022-03-28 10:01:14,423 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1136.29 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CERM/2022-03-28-1341-CERM-AUTO-minute-swob.xml
so... that doesn't look like any client-side error, but rather internet congestion... It could be on the our end or the your end, or somewhere in between. hmm... looking at stats... yeah there is heavy load, but it does not look saturated... Are you pulling down a lot of other stuff at the same time (maybe weather model outputs?) that could compete with this?
oh you can also set batch 1 if you want to see the delay accumulate more directly.
That computer is pretty idle during the day. I only do significant downloading once per hour at *:15 when I manually sync up with the SWOB directories. If the AMQP client is far behind, that might be a bunch of downloading.
I do share my home internet with my work connection. I'm often doing Teams video meetings on it, but it's 100M service, so it shouldn't be THAT bad. Also, I would expect the clients to catch up after those sessions.
I restarted it at 15:34:43. That's the event that I was trying to show. I guess the log file I posted didn't show it catch up quickly after that restart. It was back to ~5s lag within 3 minutes.
The reason I don't think it's just congestion is that it caught up really fast after I restarted the service.
swob is tiny... it's should not be that big a deal. no... I was thinking piles of model outputs, or satellite imagery or something... teams is supposed to be < 2mbs... and you say a restart fixed it... hmm... odd.
so there is a byteRateMax feature to do bandwidth limiting... if that is being triggerred inappropriately, then with the latest commit, you should see an info message
[INFO] ... throttle exceeded byteRateMax X sleeping for y
I doubt that's it, but might as well.
I just updated and restarted. Let's see what tomorrow looks like...
Another day, another slowdown. After it was more than 1000 seconds behind and not catching up, I restarted it at "21:10:06" and it quickly caught up. subscribe_rolf_01.log.gz
How many instances are you running? I think 5 would be a good number. if you run only 1, then all the downloads are sequential, if you have multiple instances then they overlap (from the various instances.) it might just accumulating delay from making everything sequential. the log above doesn't look that bad, it gets slow and catches up multiple times during the day... it isn't clear that it would not have caught up again. hmm... but it would be good to see what happens if there is >1 instances. If it's already >1, then I'll need to think of something else
instances 5
I'm running 4 instances.
If you look at just before I reset it. It had sat for literally 12 minutes with no output. The same was true for the other 3 instances. Nobody was downloading anything.
All of them stopped at the same time? It's not like instances communicate with one another... they are just independent processes. It's hard to see how a software problem would affect four processes identically after hours of downloading different stuff. It's starting to look like the broker is involved here, rather than the client software... hmm...
I suggest
batch 5
You see these long gaps because the the actual downloader is not printing anything, it receives a batch of messages, accepts them, then it downloades them all, then after it has finished downloading the entire batch, it then prints download messages. so that ten minute gap is while it is downloading all the stuff accepted above. if you reduce the batch size you will see quicker alternation between accept and download. and it might not appear stalled.
I changed my config to "batch 1" just to be pedantic.
I took a more detailed look at the logs for March 31st. They did not all stop at the same time. (1) stopped at 20:58:29. (2) stopped at 21:02:58. (3) stopped at 21:07:34. (4) stopped at 20:59:50.
That's scattered enough for me to believe the downloader is just slowing down (as you said).
I still think that something is going on here. Sometimes, the downloading almost stops (I guess it's technically still going, but if it takes more than 10 minutes to download a batch, I feel that's effectively stopped). But if I restart it, it catches up very quickly (indicating that it's not my network connection that's slowing it down).
It will be interesting to see what the slowdowns look like with "batch 1".
It's clustered within the same 9 minute period... if they each instance had a certain amount of message pre-fetched, then this kind of spread looks normal, to me. So far, I'm thinking the broker is slowing down... but will be interesting to see.
OK, I had another slowdown today after I switched to "batch 1" and I noticed an interesting pattern. subscribe_rolf_01.log.gz
I assume that the "after_accept accepted" happens before the download and "after_work downloaded" happens after the download, and I also assume that the time between those lines is approximately the download time.
Given those assumptions, around "19:25:35", the time to download each file changes from ~100ms to ~8000ms. And it stays at very close to 8000ms for more than an hour, until I perform a reset at "20:36:31". After that, it immediately goes back to ~100ms and very quickly catches up.
I had another slowdown today. Same signature. The download time suddenly jumped up to ~8s per file. It remained there for about 2 hours until I noticed it. After I reset the service, it quickly caught back up.
nothing new to say... hmm... It would be good to note the time of day when you restart... if there is no difference in the logs based on what you have already provided... I'll look into just reproducing the whole flow here... ah... idea occurs:
question... are you still writing file name from a named pipe using a plugin? I remember a while ago... I think you had forgotten about it and the plugins was still there, but you weren't using it anymore? It would be good to remove that if you don't need it, and see if this slow down goes away...
Yesterday, on, instance 1: the downloads jumped to ~8s at "19:11:07" and I initiated a reset at "21:09:29". instance 2: the downloads jumped to ~8s at "19:11:06". instance 3: "19:11:07" instance 4: "19:11:06"
So, clearly, all 4 instances are experiencing some problem at almost exactly the same second -- and just never clearing it.
Could this be another issue with cached connections? Could it be using something that times out after ~8s for each subsequent connection?
But I guess you said earlier that the instances do not talk to each other, and here we can see that they all start slowing down within 1 second of each other; which does seem strange.
I wonder if there was some sort of network issue at that moment in time which caused later downloads to take 8 seconds or something. Maybe a change in name servers? Maybe I'm getting a new IP address with different DNS?
I am still using the named pipe. I write the filename to the pipe, but I actually only use the IATA code from the filename on the other end. I can't remove that plugin without using some OTHER way of notifying my program of a new file ready for processing.
I noticed a problem this morning with my v3 instance. It was more than 20 minutes behind on the feed; and I could see from the logs that it would sometimes just sit there for 10 minutes at a time between downloads (even though I know there are many files per minute that it should be downloading). I restarted the service and it quickly caught up; but the instance did not respond during the shutdown request:
I was running 4 instances at the time, and all 4 instances had similar patterns in their log files: subscribe_rolf_01.log.gz