MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
45 stars 22 forks source link

v3 gradual slowdown #440

Closed endlisnis closed 1 year ago

endlisnis commented 2 years ago

I noticed a problem this morning with my v3 instance. It was more than 20 minutes behind on the feed; and I could see from the logs that it would sometimes just sit there for 10 minutes at a time between downloads (even though I know there are many files per minute that it should be downloading). I restarted the service and it quickly caught up; but the instance did not respond during the shutdown request:

$ ~/.local/bin/sr3 restart subscribe/rolf
restarting: sending SIGTERM .... ( 4 ) Done
Waiting 1 sec. to check if 4 processes stopped (try: 0)
Waiting 2 sec. to check if 4 processes stopped (try: 1)
Waiting 4 sec. to check if 4 processes stopped (try: 2)
Waiting 8 sec. to check if 4 processes stopped (try: 3)
Waiting 16 sec. to check if 4 processes stopped (try: 4)
doing SIGKILL this time
signal_pid( 2528642, SIGKILL )

.signal_pid( 2528644, SIGKILL )
.signal_pid( 2528645, SIGKILL )
.signal_pid( 2528643, SIGKILL )
.Done
Waiting again...
All stopped after KILL
.( 4 ) Done

I was running 4 instances at the time, and all 4 instances had similar patterns in their log files: subscribe_rolf_01.log.gz

petersilva commented 2 years ago

This is super interesting... what I see in the log, general trend:

Periodic "housekeeping intervals" are supposed to happen every 'housekeeping' interval, which is, by default, 5 minutes. The actual intervals are all over the map... sample interval from beginning of the file: 09,23,37,50,03,07,21,34,48 ... Now it only checks to run housekeeping between transfer batches... so it looks like individual batches are taking 20 minutes...

The processing of messages/files is done in batches, and you see a after_accept messages happen @ 0h09, then it is quiet for 15 minutes (until 0h23) it is downloading all those files. It is not that it was sitting there doing nothing, it's that the success log is only happenning after the entire batch is complete.

It is the transfers themselves from the datamart that are pathologically slow at those times... so the built-in transfers are slowing down... guesses:

hmm... I'm going to look into some periodic plugins to report memory usage. in the meantime, you can restart every couple of days with a cron job, and that might prevent a recurrence, or if you are willing to tolerate a recurrence, I will provide a memory reporter, and We already have a bug outstanding where periodic processing is supposed to restart when memory leaks are detected ( #395 )

conclusion:

endlisnis commented 2 years ago

subscribe_rolf_01.log.2022-01-08.gz

Here's yesterday's log file that shows the slowdown start in the evening.

petersilva commented 2 years ago

b430327dac6bf12ef42cc3db0a92b2a3a6fd4953 337e2632835aa8ea40c10f8836b7b1c0b103dfa1

messages for average latency produced at housekeeping intervals should be more helpful now... metrics are per housekeeping interval, rather than cumulative from start.

endlisnis commented 2 years ago

It happened again: subscribe_rolf_01.log.2022-01-15.gz

You can ignore the 3am slowdown, that is caused by some local processes. But at around 9:00am, it got very slow again and only sped up after I restarted it.

petersilva commented 2 years ago

looking at the log, exactly as you say... I'd like some memory information to be in the housekeeping reports... still waiting on #395

endlisnis commented 2 years ago

There is something else going on here too. No matter how slow it gets, it never falls more than about 1770 seconds behind; and if I get an outage longer than ~1770 seconds, it resumes about 1770 seconds out (instead of the 4h that I have in my config).

It seems as if the queue is deleting messages after around 1770 seconds.

petersilva commented 2 years ago

bunch of different possibilities:

how many updates do you get per hour approx?

junhu3 commented 2 years ago

max-length:10000 for anonymous queues

petersilva commented 2 years ago

only ten thousand... yeah that makes more sense... it would do that...

endlisnis commented 2 years ago

I get ~20k items per hour, which makes the 1700 second limit make sense.

petersilva commented 2 years ago

fwiw @MagikEh is making great progress on #395 ... I think it will solve this.

petersilva commented 2 years ago

now that #395 is merged... @endlisnis you should update your configuration and stop/start. With the new code if the memory usage gets too large, it should restart on it´s own, so you don´t need to do anything.

endlisnis commented 2 years ago

I updated my clients. Let's wait a week and see if I have this problem again (I was having it on average about once every 2 days).

endlisnis commented 2 years ago

The clients ran fine for many days, but they've been acting very strange today. I'm not sure if it's the same problem, but I keep falling behind; but there seems to be some mix of very old and new entries in the log: subscribe_rolf_01.log.gz

I tried restarting it, but it seems to have just fallen behind again.

endlisnis commented 2 years ago

I also (in the above log) get a bunch of messages like "download incomplete download only 9452 of expected 9455 bytes for 2022-02-23-1740-CVYD-AUTO-minute-swob.xml" when I restart the clients.

petersilva commented 2 years ago

There were lots of problems with internal systems yesterday. The feeds to to DMZ systems were affected. I don't think anything yesterday is representative of a problem with your feed.

petersilva commented 2 years ago

Is it back to normal today?

endlisnis commented 2 years ago

Yes, it seems fine today.

endlisnis commented 2 years ago

I've been suffering problems over the last few days that also cause my client to fall way behind. I restarted them about 15 minutes ago and they quickly caught up. Here is a log from instance 1: subscribe_rolf_01.log.gz

petersilva commented 2 years ago

great log... It isn't clear when you restarted... It doesn't look like it logged anything on shutdown or startup. from your description and the log, I'm guessing 20:26?

summary of what is in the log

it started slowing here:

2022-03-23 12:57:56,825 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 3.00 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CVQZ/2022-03-23-1657-CVQZ-AUTO-minute-swob.xml
2022-03-23 12:57:56,825 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 3.00 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWAQ/2022-03-23-1657-CWAQ-AUTO-minute-swob.xml
2022-03-23 12:57:56,825 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.96 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWZE/2022-03-23-1657-CWZE-AUTO-minute-swob.xml
2022-03-23 12:57:56,825 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.96 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWKH/2022-03-23-1657-CWKH-AUTO-minute-swob.xml
2022-03-23 12:57:56,825 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.45 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWBM/2022-03-23-1657-CWBM-AUTO-minute-swob.xml
2022-03-23 12:59:17,937 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CXKI/2022-03-23-1657-CXKI-AUTO-minute-swob.xml
2022-03-23 12:59:17,937 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CXOX/2022-03-23-1657-CXOX-AUTO-minute-swob.xml
2022-03-23 12:59:17,937 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CXPT/2022-03-23-1657-CXPT-AUTO-minute-swob.xml
2022-03-23 12:59:17,937 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CXKE/2022-03-23-1656-CXKE-AUTO-minute-swob.xml
2022-03-23 12:59:17,937 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CWEW/2022-03-23-1657-CWEW-AUTO-minute-swob.xml
2022-03-23 12:59:17,937 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CVQZ/2022-03-23-1657-CVQZ-AUTO-minute-swob.xml
2022-03-23 12:59:17,938 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CWAQ/2022-03-23-1657-CWAQ-AUTO-minute-swob.xml
2022-03-23 12:59:17,938 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CWZE/2022-03-23-1657-CWZE-AUTO-minute-swob.xml
2022-03-23 12:59:17,938 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CWKH/2022-03-23-1657-CWKH-AUTO-minute-swob.xml
2022-03-23 12:59:17,938 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CWBM/2022-03-23-1657-CWBM-AUTO-minute-swob.xml
2022-03-23 12:59:20,440 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 61.79 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWDV/2022-03-23-1657-CWDV-AUTO-minute-swob.xml
2022-03-23 12:59:20,440 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 61.79 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CAMS/2022-03-23-1657-CAMS-AUTO-minute-swob.xml
2022-03-23 12:59:20,440 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 61.60 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CVXC/2022-03-23-1657-CVXC-AUTO-minute-swob.xml
2022-03-23 12:59:20,440 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 62.87 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWPN/2022-03-23-1657-CWPN-AUTO-minute-swob.xml

then was stable for a little while until


022-03-23 12:59:20,452 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 61.10 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CVOQ/2022-03-23-1657-CVOQ-AUTO-minute-swob.xml
2022-03-23 12:59:20,452 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 61.10 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWKX/2022-03-23-1657-CWKX-AUTO-minute-swob.xml
2022-03-23 12:59:20,452 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 61.10 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWOK/2022-03-23-1657-CWOK-AUTO-minute-swob.xml
2022-03-23 12:59:20,452 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 701.81 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CPUN/2022-03-23-1646-CPUN-AUTO-minute-swob.xml
2022-03-23 12:59:20,452 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 701.81 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWDJ/2022-03-23-1646-CWDJ-AUTO-minute-swob.xml
2022-03-23 12:59:20,452 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 701.81 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CVSJ/2022-03-23-1646-CVSJ-AUTO-minute-swob.xml

then it got gradually worse, until the worst it got was here:


2022-03-23 15:50:06,217 [INFO] sarracenia.flowcb.log housekeeping_stats lag: average: 1770.75, maximum: 1772.01
2022-03-23 15:50:06,217 [INFO] sarracenia.flowcb.log on_housekeeping housekeeping
2022-03-23 15:50:08,470 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CPOX/2022-03-23-1919-CPOX-AUTO-minute-swob.xml
2022-03-23 15:50:08,470 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CVFN/2022-03-23-1919-CVFN-AUTO-minute-swob.xml
2022-03-23 15:50:08,471 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWZT/2022-03-23-1919-CWZT-AUTO-minute-swob.xml
2022-03-23 15:50:08,471 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CPUN/2022-03-23-1919-CPUN-AUTO-minute-swob.xml
2022-03-23 15:50:08,471 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1806.14 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CVYD/2022-03-23-1919-CVYD-AUTO-minute-swob.xml
2022-03-23 15:50:08,471 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1806.14 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CTZN/2022-03-23-1919-CTZN-AUTO-minute-swob.xml
2022-03-23 15:50:08,471 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CZGH/2022-03-23-1919-CZGH-AUTO-minute-swob.xml
2022-03-23 15:50:08,472 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWIW/2022-03-23-1919-CWIW-AUTO-minute-swob.xml
2022-03-23 15:50:08,472 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1806.14 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CZKA/2022-03-23-1919-CZKA-AUTO-minute-swob.xml
2022-03-23 15:50:08,472 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1806.14 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CXET/2022-03-23-1919-CXET-AUTO-minute-swob.xml
2022-03-23 15:50:08,472 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CTPM/2022-03-23-1919-CTPM-AUTO-minute-swob.xml
2022-03-23 15:50:08,472 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.66 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CPQD/2022-03-23-1919-CPQD-AUTO-minute-swob.xml
2022-03-23 15:50:08,472 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1806.14 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/COGE/2022-03-23-1919-COGE-AUTO-minute-swob.xml
2022-03-23 15:50:08,473 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.65 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CTUX/2022-03-23-1919-CTUX-AUTO-minute-swob.xml
2022-03-23 15:50:08,473 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1806.14 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CASF/2022-03-23-1919-CASF-AUTO-minute-swob.xml
2022-03-23 15:50:08,473 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1805.65 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWNH/2022-03-23-1919-CWNH-AUTO-minute-swob.xml
2022-03-23 15:50:08,473 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1806.14 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CPYQ/2022-03-23-1919-CPYQ-AUTO-minute-swob.xml

then it was intermittently recovering for a while until this weirdness happenned:


2022-03-23 17:11:13,001 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1576.31 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWCO/2022-03-23-2044-CWCO-AUTO-minute-swob.xml
2022-03-23 17:11:13,001 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1576.35 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWPZ/2022-03-23-2044-CWPZ-AUTO-minute-swob.xml
2022-03-23 17:11:13,001 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1576.31 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CXFV/2022-03-23-2044-CXFV-AUTO-minute-swob.xml
2022-03-23 17:11:13,001 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1576.35 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWVT/2022-03-23-2044-CWVT-AUTO-minute-swob.xml
2022-03-23 17:22:02,796 [ERROR] sarracenia.transfer read_writelocal util/writelocal mismatched file length writing 2022-03-23-2039-CWXA-AUTO-minute-swob.xml. Message said to expect 8564 bytes.  Got 8563 bytes.
2022-03-23 17:22:02,796 [ERROR] sarracenia.flow download incomplete download only 8563 of expected 8564 bytes for 2022-03-23-2039-CWXA-AUTO-minute-swob.xml
2022-03-23 17:22:02,796 [INFO] sarracenia.flow do_download attempt 1 failed to download https://dd5.weather.gc.ca//observations/swob-ml/20220323/CWXA/2022-03-23-2039-CWXA-AUTO-minute-swob.xml to /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CWXA/2022-03-23-2039-CWXA-AUTO-minute-swob.xml
2022-03-23 17:22:02,796 [WARNING] sarracenia.flow do_download downloading again, attempt 2
2022-03-23 17:22:10,904 [ERROR] sarracenia.transfer read_writelocal util/writelocal mismatched file length writing 2022-03-23-2039-CWXA-AUTO-minute-swob.xml. Message said to expect 8564 bytes.  Got 8563 bytes.
2022-03-23 17:22:10,904 [ERROR] sarracenia.flow download incomplete download only 8563 of expected 8564 bytes for 2022-03-23-2039-CWXA-AUTO-minute-swob.xml
2022-03-23 17:22:10,904 [INFO] sarracenia.flow do_download attempt 2 failed to download https://dd5.weather.gc.ca//observations/swob-ml/20220323/CWXA/2022-03-23-2039-CWXA-AUTO-minute-swob.xml to /home/rolf/weather/swobMirror/observations/swob-ml/20220323/CWXA/2022-03-23-2039-CWXA-AUTO-minute-swob.xml
2022-03-23 17:22:10,904 [WARNING] sarracenia.flow do_download downloading again, attempt 3
2022-03-23 17:22:19,008 [ERROR] sarracenia.transfer read_writelocal util/writelocal mismatched file length writing 2022-03-23-2039-CWXA-AUTO-minute-swob.xml. Message said to expect 8564 bytes.  Got 8563 bytes.
2022-03-23 17:22:19,008 [ERROR] sarracenia.flow download incomplete download only 8563 of expected 8564 bytes for 2022-03-23-2039-CWXA-AUTO-minute-swob.xml
2022-03-23 17:22:19,008 [INFO] sarracenia.flow do_download attempt 3 failed to download https://dd5.weather.gc.ca//observations/swob-ml/20220323/CWXA/2022-03-23-2039-CWXA-AUTO-minute-swob.xml to /home/rolf/weather/swobMirror

after recovering from that, then it was suddenly only 5 minutes behind, instead of 30:


022-03-23 17:25:27,174 [INFO] sarracenia.diskqueue on_housekeeping post_retry_001 on_housekeeping
2022-03-23 17:25:27,174 [INFO] sarracenia.diskqueue on_housekeeping No retry in list
2022-03-23 17:25:27,174 [INFO] sarracenia.diskqueue on_housekeeping on_housekeeping elapse 0.000157
2022-03-23 17:25:27,174 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current Memory cpu_times: user=3025.45 system=575.89 elapse=17455928.9
2022-03-23 17:25:27,174 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current Memory usage: 75.3 MiB / 224.3 MiB = 33.56%
2022-03-23 17:25:27,174 [INFO] sarracenia.flowcb.log housekeeping_stats messages_received: 100, accepted: 97, rejected: 3 ( bytes: 2134 )
2022-03-23 17:25:27,174 [INFO] sarracenia.flowcb.log housekeeping_stats files transferred: 93, cumulative bytes of data: 858293
2022-03-23 17:25:27,174 [INFO] sarracenia.flowcb.log housekeeping_stats lag: average: 1586.41, maximum: 1613.64
2022-03-23 17:25:27,174 [INFO] sarracenia.flowcb.log on_housekeeping housekeeping
2022-03-23 17:25:29,284 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 297.36 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWXA/2022-03-23-2020-CWXA-AUTO-minute-swob.xml
2022-03-23 17:25:29,284 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 297.36 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CACQ/2022-03-23-2043-CACQ-AUTO-minute-swob.xml
2022-03-23 17:25:29,284 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 296.33 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CVYD/2022-03-23-2043-CVYD-AUTO-minute-swob.xml
2022-03-23 17:25:29,284 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 296.33 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CABF/2022-03-23-2043-CABF-AUTO-minute-swob.xml
2022-03-23 17:25:29,284 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 296.33 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWPN/2022-03-23-2043-CWPN-AUTO-minute-swob.xml
2022-03-23 17:25:29,284 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 296.33 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWJV/2022-03-23-2044-CWJV-AUTO-minute-swob.xml

but the lag lengthened again... gradually... and the transfers went slowly, so it fell behind.


2022-03-23 17:51:45,770 [INFO] sarracenia.flowcb.log housekeeping_stats lag: average: 956.50, maximum: 962.80
2022-03-23 17:51:45,770 [INFO] sarracenia.flowcb.log on_housekeeping housekeeping
2022-03-23 17:51:48,669 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1593.56 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CXKI/2022-03-23-2121-CXKI-AUTO-minute-swob.xml
2022-03-23 17:51:48,669 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1592.77 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWFF/2022-03-23-2121-CWFF-AUTO-minute-swob.xml
2022-03-23 17:51:48,669 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1593.56 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CXRB/2022-03-23-2121-CXRB-AUTO-minute-swob.xml

lag wandered around again, and built up to 30 minutes (1800 seconds) again. peaked at 1900. and gradually recovered to <10... then: weird grouping of old messages. My guess is that when it was restarted. when the re-connection happenned, it resent some old messages that had never been acknowledged:

022-03-23 20:26:09,926 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220324/CNCD/2022-03-24-0025-CNCD-AUTO-minute-swob.xml
2022-03-23 20:26:09,926 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220324/CWCH/2022-03-24-0025-CWCH-AUTO-minute-swob.xml
2022-03-23 20:26:09,926 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220324/CTPM/2022-03-24-0025-CTPM-AUTO-minute-swob.xml
2022-03-23 20:26:09,948 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 13274.32 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWXA/2022-03-23-2039-CWXA-AUTO-minute-swob.xml
2022-03-23 20:26:09,948 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 13274.32 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWXA/2022-03-23-2035-CWXA-AUTO-minute-swob.xml
2022-03-23 20:26:09,948 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 13274.31 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWXA/2022-03-23-2033-CWXA-AUTO-minute-swob.xml
2022-03-23 20:26:09,948 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 13273.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWXA/2022-03-23-2043-CWXA-AUTO-minute-swob.xml
2022-03-23 20:26:10,046 [ERROR] sarracenia.transfer read_writelocal util/writelocal mismatched file length writing 2022-03-23-2039-CWXA-AUTO-minute-swob.xml. Message said to expect 8564 bytes.  Got 8563 bytes.
2022-03-23 20:26:10,046 [ERROR] sarracenia.flow download incomplete download only 8563 of expected 8564 bytes for 2022-03-23-2039-CWXA-AUTO-minute-swob.xml

followed immediately by <2 second lag...

2022-03-23 20:26:13,498 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.28 ) https://dd5.weather.gc.ca /observations/swob-ml/20220324/CWLJ/2022-03-24-0025-CWLJ-AUTO-minute-swob.xml
2022-03-23 20:26:13,498 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.27 ) https://dd5.weather.gc.ca /observations/swob-ml/20220324/CZEV/2022-03-24-0025-CZEV-AUTO-minute-swob.xml
2022-03-23 20:26:13,498 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.27 ) https://dd5.weather.gc.ca /observations/swob-ml/20220324/CNEK/2022-03-24-0025-CNEK-AUTO-minute-swob.xml
2022-03-23 20:26:13,498 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.27 ) https://dd5.weather.gc.ca /observations/swob-ml/20220324/CWJR/2022-03-24-0025-CWJR-AUTO-minute-swob.xml
2022-03-23 20:26:13,499 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220324/CZGH/2022-03-24-0025-CZGH-AUTO-minute-swob.xml
2022-03-23 20:26:13,499 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220324/CVFN/2022-03-24-0025-CVFN-AUTO-minute-swob.xml
2022-03-23 20:26:13,499 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220324/CMSI/2022-03-24-0025-CMSI-AUTO-minute-swob.xml
endlisnis commented 2 years ago

It doesn't look like it logged anything on shutdown or startup

When I restarted it, none of the instances responded to the TERM requests. They ended up getting killed after a while.

This is the timing of the restart: 2022-03-23 20:19:25,571 [INFO] root stop_signal signal 15 received

endlisnis commented 2 years ago

I also noticed that sometimes it saves partial downloaded files. This is really brutal for my application. I have a periodic task that runs once-per-hour that tries to poll for files that were missed; but since PART of these files were downloaded, I can't easily tell that I should try to download it again.

I'm having to check the file extension, and if it's .xml, then I have to see if ">" is the last character in the file. If it is not, then assume this is corrupt and download it again.

For my application, it would be much better if partial downloads were thrown away.

matthewdarwin commented 2 years ago

I wrote my own downloader rather than the built-in one.

I have a custom pluging where serra just sends notification to my downloader. My downloader

petersilva commented 2 years ago

fwiw @matthewdarwin ... you can do all that built-in downloader:

inflight tmp/

would download the file to a subdirectory named tmp instead ot files named .tmp and then move them when complete.

acceptSizeWrong off

will have it fail the downloads of files when the size is wrong, and it will retry them automatically.

instances 4

you can control how many downloaders operate in parellel with instances. we have cases with 40 or more processes munching on the same queue. The acceptSizeWrong is perhaps only in sr3, but the others have been there since 2018 at least.

To prioritize things, configure a separate subscriptions for the high priority stuff vs. the others (two separate configurations) so that there is a high priority queue and and another queue for everything else.

petersilva commented 2 years ago

@endlisnis the option:

for the wrong size giving you headaches... short version add a line like this to your config:


acceptSizeWrong off

https://metpx.github.io/sarracenia/Reference/sr3_options.7.html#acceptsizewrong-boolean-default-false

matthewdarwin commented 2 years ago

Thanks for the hints. I need to prioritize downloads amongst the background scraper jobs too, which don't run via sarra. If internet connection is down for a few hours, there are many files to download and catch up.

petersilva commented 2 years ago

hey @matthewdarwin if the download CLI isn't giving any value to you and you are using custom code anyways, you might want to use the (new) python API, and not bother with the CLI at all. Demonstration here:

endlisnis commented 2 years ago

acceptSizeWrong off should be the default in v3. For 99% of the use cases I can think of, keeping partial files is wrong and complicated.

petersilva commented 2 years ago

@endlisnis yes... looking at the code... I'm confused... because it already is the default... why it would not retry in that case is puzzling... looking.... wow... it did reject the file... it wasn't happy. it just failed to queue it up for retry... but it got it half-right, in that the reception was never acknowledged, so that later:


2022-03-23 20:26:09,948 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 13274.31 ) https://dd5.weather.gc.ca /observations/swob-ml/20220323/CWXA/2022-03-23-2033-CWXA-AUTO-minute-swob.xml

so it's a bug where it has kind of half-detected that the transfer failed... AMQP decided to resend the message when re-connection happenned. why didn't it get queued for retry? oh... it did:

2022-03-23 19:38:39,792 [INFO] sarracenia.diskqueue on_housekeeping Number of messages in retry list 4
2022-03-23 19:38:39,793 [INFO] sarracenia.diskqueue on_housekeeping on_housekeeping elapse 0.006980
2022-03-23 19:38:39,793 [INFO] sarracenia.diskqueue on_housekeeping post_retry_001 on_housekeeping
2022-03-23 19:38:39,793 [INFO] sarracenia.diskqueue on_housekeeping No retry in list

so getting on the retry list is ok. but then it prints 'No Retry in list'... so ... it's like it cannot read from it. Hey @reidsunderland this looks like the same case you were seeing in the flow test? it's real! stuff is getting put on the retry list, but fetching from it isn't working in at least one real case.

So there is a problem in the retry logic... we will work on that. I think @reidsunderland might already on it in the context of another issue.

reidsunderland commented 2 years ago

Yes! It seems to be related to, or the same problem that I'm working on for #466.

endlisnis commented 2 years ago

I continue to have daily problems with my feeds falling behind (often lag exceeding 400 seconds, sometimes getting up near 1500 seconds). Do you need any more logs from me?

petersilva commented 2 years ago

488 might fix this.

endlisnis commented 2 years ago

I have the fix for #488, but I (again) experienced a slowdown today. subscribe_rolf_01.log.gz

petersilva commented 2 years ago

Analyzing this one.

was going up and down, as bad as 5 minutes, but usually only a few seconds, until:


2022-03-28 02:19:19,697 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CXWG/2022-03-28-0618-CXWG-AUTO-minute-swob.xml
2022-03-28 02:19:19,697 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWJR/2022-03-28-0618-CWJR-AUTO-minute-swob.xml
2022-03-28 02:19:19,697 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CTPQ/2022-03-28-0618-CTPQ-AUTO-minute-swob.xml
2022-03-28 02:19:19,697 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.26 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWLJ/2022-03-28-0617-CWLJ-AUTO-minute-swob.xml
2022-03-28 03:25:28,486 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220328/CZPK/2022-03-28-0601-CZPK-AUTO-minute-swob.xml
2022-03-28 03:25:28,487 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220328/CARP/2022-03-28-0618-CARP-AUTO-minute-swob.xml
20

then it hung for an hour or so...

2022-03-28 03:25:28,495 [INFO] sarracenia.flowcb.log stats lag: average: 4.07, maximum: 19.37
2022-03-28 03:25:28,495 [INFO] sarracenia.flowcb.log on_housekeeping housekeeping
2022-03-28 03:25:31,292 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1603.51 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CVGE/2022-03-28-0653-CVGE-AUTO-minute-swob.xml
2022-03-28 03:25:31,293 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1603.51 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWXB/2022-03-28-0658-CWXB-AUTO-minute-swob.xml
2022-03-28 03:25:31,293 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1597.86 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWPU/2022-03-28-0658-CWPU-AUTO-minute-swob.xml
2022-03-28 03:25:31,293 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1597.86 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/COGE/2022-03-28-0658-COGE-AUTO-minute-swob.xml
2022-03-28 03:25:31,293 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1598.39 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CVRA/2022-03-28-0658-CVRA-AUTO-minute-swob.xml

caught up in 4 minutes after restoral:

2022-03-28 03:39:55,068 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 0.50 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CVBB/2022-03-28-0739-CVBB-AUTO-minute-swob.xml
2022-03-28 03:39:55,069 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 0.49 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWVQ/2022-03-28-0739-CWVQ-AUTO-minute-swob.xml
2022-03-28 03:39:55,069 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 0.49 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWUS/2022-03-28-0739-CWUS-AUTO-minute-swob.xml
2022-03-28 03:39:55,590 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220328/CVBB/2022-03-28-0739-CVBB-AUTO-minute-swob.xml

The symptom is different. It doesn't show any error message, but just hangs. did you restart around 3:25 ? perhaps a cron job? anyways, then it was good for about 6 hours, and then it gradually slowed down:

2022-03-28 09:40:50,470 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 8.07 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWCT/2022-03-28-1339-CWCT-AUTO-minute-swob.xml
2022-03-28 09:40:50,471 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 7.91 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWTE/2022-03-28-1339-CWTE-AUTO-minute-swob.xml
2022-03-28 09:40:50,471 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 8.05 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CNZS/2022-03-28-1339-CNZS-AUTO-minute-swob.xml
2022-03-28 09:47:37,401 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220328/CMGZ/2022-03-28-1339-CMGZ-AUTO-minute-swob.xml
.
.
.
2022-03-28 09:47:39,509 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 400.76 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWSZ/2022-03-28-1340-CWSZ-AUTO-minute-swob.xml
2022-03-28 09:47:39,509 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 400.75 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CXCA/2022-03-28-1340-CXCA-AUTO-minute-swob.xml

... then a 13 minute delay while downloading occurs... this 13 minutes represents the time before start of download and completion for the batch of files. accepted above.

2022-03-28 09:47:39,512 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 397.56 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWEI/2022-03-28-1340-CWEI-AUTO-minute-swob.xml
2022-03-28 10:01:12,261 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220328/CWSZ/2022-03-28-1340-CWSZ-AUTO-minute-swob.xml

.
.
.

... in the same second, it receives the next batch of messages:
2022-03-28 10:01:12,263 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220328/CWEI/2022-03-28-1340-CWEI-AUTO-minute-swob.xml
2022-03-28 10:01:12,263 [INFO] sarracenia.flow run on_housekeeping pid: 3385964 subscribe/rolf instance: 1
2022-03-28 10:01:12,263 [INFO] sarracenia.flowcb.gather.message on_housekeeping messages: good: 100 bad: 0 bytes: 11.9 KiB average: 121 Bytes
... now it delayed by about 18 minutes...
2022-03-28 10:01:12,264 [INFO] sarracenia.flowcb.log on_housekeeping housekeeping
2022-03-28 10:01:14,423 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1136.29 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CWLV/2022-03-28-1341-CWLV-AUTO-minute-swob.xml
2022-03-28 10:01:14,423 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1136.29 ) https://dd5.weather.gc.ca /observations/swob-ml/20220328/CERM/2022-03-28-1341-CERM-AUTO-minute-swob.xml

so... that doesn't look like any client-side error, but rather internet congestion... It could be on the our end or the your end, or somewhere in between. hmm... looking at stats... yeah there is heavy load, but it does not look saturated... Are you pulling down a lot of other stuff at the same time (maybe weather model outputs?) that could compete with this?

petersilva commented 2 years ago

oh you can also set batch 1 if you want to see the delay accumulate more directly.

endlisnis commented 2 years ago

That computer is pretty idle during the day. I only do significant downloading once per hour at *:15 when I manually sync up with the SWOB directories. If the AMQP client is far behind, that might be a bunch of downloading.

I do share my home internet with my work connection. I'm often doing Teams video meetings on it, but it's 100M service, so it shouldn't be THAT bad. Also, I would expect the clients to catch up after those sessions.

I restarted it at 15:34:43. That's the event that I was trying to show. I guess the log file I posted didn't show it catch up quickly after that restart. It was back to ~5s lag within 3 minutes.

The reason I don't think it's just congestion is that it caught up really fast after I restarted the service.

petersilva commented 2 years ago

swob is tiny... it's should not be that big a deal. no... I was thinking piles of model outputs, or satellite imagery or something... teams is supposed to be < 2mbs... and you say a restart fixed it... hmm... odd.

petersilva commented 2 years ago

so there is a byteRateMax feature to do bandwidth limiting... if that is being triggerred inappropriately, then with the latest commit, you should see an info message


  [INFO] ... throttle exceeded byteRateMax X sleeping for y 

I doubt that's it, but might as well.

endlisnis commented 2 years ago

I just updated and restarted. Let's see what tomorrow looks like...

endlisnis commented 2 years ago

Another day, another slowdown. After it was more than 1000 seconds behind and not catching up, I restarted it at "21:10:06" and it quickly caught up. subscribe_rolf_01.log.gz

petersilva commented 2 years ago

How many instances are you running? I think 5 would be a good number. if you run only 1, then all the downloads are sequential, if you have multiple instances then they overlap (from the various instances.) it might just accumulating delay from making everything sequential. the log above doesn't look that bad, it gets slow and catches up multiple times during the day... it isn't clear that it would not have caught up again. hmm... but it would be good to see what happens if there is >1 instances. If it's already >1, then I'll need to think of something else

instances 5
endlisnis commented 2 years ago

I'm running 4 instances.

If you look at just before I reset it. It had sat for literally 12 minutes with no output. The same was true for the other 3 instances. Nobody was downloading anything.

petersilva commented 2 years ago

All of them stopped at the same time? It's not like instances communicate with one another... they are just independent processes. It's hard to see how a software problem would affect four processes identically after hours of downloading different stuff. It's starting to look like the broker is involved here, rather than the client software... hmm...

petersilva commented 2 years ago

I suggest

   batch 5

You see these long gaps because the the actual downloader is not printing anything, it receives a batch of messages, accepts them, then it downloades them all, then after it has finished downloading the entire batch, it then prints download messages. so that ten minute gap is while it is downloading all the stuff accepted above. if you reduce the batch size you will see quicker alternation between accept and download. and it might not appear stalled.

endlisnis commented 2 years ago

I changed my config to "batch 1" just to be pedantic.

I took a more detailed look at the logs for March 31st. They did not all stop at the same time. (1) stopped at 20:58:29. (2) stopped at 21:02:58. (3) stopped at 21:07:34. (4) stopped at 20:59:50.

That's scattered enough for me to believe the downloader is just slowing down (as you said).

I still think that something is going on here. Sometimes, the downloading almost stops (I guess it's technically still going, but if it takes more than 10 minutes to download a batch, I feel that's effectively stopped). But if I restart it, it catches up very quickly (indicating that it's not my network connection that's slowing it down).

It will be interesting to see what the slowdowns look like with "batch 1".

petersilva commented 2 years ago

It's clustered within the same 9 minute period... if they each instance had a certain amount of message pre-fetched, then this kind of spread looks normal, to me. So far, I'm thinking the broker is slowing down... but will be interesting to see.

endlisnis commented 2 years ago

OK, I had another slowdown today after I switched to "batch 1" and I noticed an interesting pattern. subscribe_rolf_01.log.gz

I assume that the "after_accept accepted" happens before the download and "after_work downloaded" happens after the download, and I also assume that the time between those lines is approximately the download time.

Given those assumptions, around "19:25:35", the time to download each file changes from ~100ms to ~8000ms. And it stays at very close to 8000ms for more than an hour, until I perform a reset at "20:36:31". After that, it immediately goes back to ~100ms and very quickly catches up.

endlisnis commented 2 years ago

I had another slowdown today. Same signature. The download time suddenly jumped up to ~8s per file. It remained there for about 2 hours until I noticed it. After I reset the service, it quickly caught back up.

petersilva commented 2 years ago

nothing new to say... hmm... It would be good to note the time of day when you restart... if there is no difference in the logs based on what you have already provided... I'll look into just reproducing the whole flow here... ah... idea occurs:

question... are you still writing file name from a named pipe using a plugin? I remember a while ago... I think you had forgotten about it and the plugins was still there, but you weren't using it anymore? It would be good to remove that if you don't need it, and see if this slow down goes away...

endlisnis commented 2 years ago

Yesterday, on, instance 1: the downloads jumped to ~8s at "19:11:07" and I initiated a reset at "21:09:29". instance 2: the downloads jumped to ~8s at "19:11:06". instance 3: "19:11:07" instance 4: "19:11:06"

So, clearly, all 4 instances are experiencing some problem at almost exactly the same second -- and just never clearing it.

Could this be another issue with cached connections? Could it be using something that times out after ~8s for each subsequent connection?

But I guess you said earlier that the instances do not talk to each other, and here we can see that they all start slowing down within 1 second of each other; which does seem strange.

I wonder if there was some sort of network issue at that moment in time which caused later downloads to take 8 seconds or something. Maybe a change in name servers? Maybe I'm getting a new IP address with different DNS?

I am still using the named pipe. I write the filename to the pipe, but I actually only use the IATA code from the filename on the other end. I can't remove that plugin without using some OTHER way of notifying my program of a new file ready for processing.