MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
45 stars 22 forks source link

v3 gradual slowdown #440

Closed endlisnis closed 1 year ago

endlisnis commented 2 years ago

I noticed a problem this morning with my v3 instance. It was more than 20 minutes behind on the feed; and I could see from the logs that it would sometimes just sit there for 10 minutes at a time between downloads (even though I know there are many files per minute that it should be downloading). I restarted the service and it quickly caught up; but the instance did not respond during the shutdown request:

$ ~/.local/bin/sr3 restart subscribe/rolf
restarting: sending SIGTERM .... ( 4 ) Done
Waiting 1 sec. to check if 4 processes stopped (try: 0)
Waiting 2 sec. to check if 4 processes stopped (try: 1)
Waiting 4 sec. to check if 4 processes stopped (try: 2)
Waiting 8 sec. to check if 4 processes stopped (try: 3)
Waiting 16 sec. to check if 4 processes stopped (try: 4)
doing SIGKILL this time
signal_pid( 2528642, SIGKILL )

.signal_pid( 2528644, SIGKILL )
.signal_pid( 2528645, SIGKILL )
.signal_pid( 2528643, SIGKILL )
.Done
Waiting again...
All stopped after KILL
.( 4 ) Done

I was running 4 instances at the time, and all 4 instances had similar patterns in their log files: subscribe_rolf_01.log.gz

petersilva commented 2 years ago

Aha! a pattern... that is encouraging! That's midnight z in the winter, but only 23h in the summer.

Um do you have a cron job running that might be archiving on your disk or something competing with the downloader for io?

You could put in a cron job at 19:15 to restart it automagically as a work-around... but it would be nice to actually figure out.

it's probably not the named pipe thing, or if it is... maybe that thing dies at 19:00 and then the pipe fills up and weird slow downs happen, then when you restart, it restarts the other thing, and empties the pipe?

but I don't know what sort of other processing you are doing, but we could do a plugin to invoke it directly from sarra, rather than doing IPC? but I don't think that's the problem based on the timing info.

endlisnis commented 2 years ago

It happened at 19:25 one day, and 19:11 3 days later. That is interesting.

I do have an archiving task, but it runs at 02:00 each day. The pipe will stall for about 40 minutes during that archiving, but (oddly enough), that never seems to bother sr3; it recovers from that just fine.

I do run hourly tasks at 15 minutes after each hour, but that should not really bother sr3 (in theory).

When I restart sr3, it does NOT restart the other end of the pipe, so I don't see how restarting sr3 would somehow recover both ends of that named pipe. The other end of the pipe is idle during those slowdown periods.

There are certainly times when my scripts fail to read from the pipe (they sometimes block for hours and need some human interaction to fix them up again). But when that happens, sr3 just blocks forever (which is fine -- it is NOT what we are seeing here). I would actually love a way to have a much deeper named pipe (think hundreds of megs), but I haven't put too much thought into it.

petersilva commented 2 years ago

for pipe... we could just write to an hourly file, and your reader could just stay an hour behind all the time.. so we write a new file every hour, and your reader reads all the files except the current one...

but not solving the issue...

petersilva commented 2 years ago

you don't happen to know if it happened at an hour later during the winter, do you? that would indicate a UTC cron job on our side perhaps.

petersilva commented 2 years ago

hey... we can do an A:B test to exclude client-side stuff. Can you re-point your job at hpfx? change to the config file:


broker  https://hpfx.collab.science.gc.ca

subtopic  *.WXO-DD.<whatever you had for dd> 

we try that for a few days and see if the timing stays the same?

endlisnis commented 2 years ago

I run into trouble when I use hpfx: I use mirror True, and that ends up pushing files to a different path on my local setup. Is there a way to strip off some path parts for the mirror path?

petersilva commented 2 years ago
strip 2 

strip first two elements of the relPath?

https://metpx.github.io/sarracenia/Reference/sr3_options.7.html?highlight=strip#strip-count-regexp-default-0

endlisnis commented 2 years ago

I can't stop my instances after doing an git pull. I get this error:

$ ~/.local/bin/sr3 stop subscribe/rolf
Traceback (most recent call last):
  File "/home/rolf/.local/bin/sr3", line 33, in <module>
    sys.exit(load_entry_point('metpx-sr3', 'console_scripts', 'sr3')())
  File "/home/rolf/weather/sr3/sarracenia/sr.py", line 2019, in main
    gs = sr_GlobalState(cfg, cfg.configurations)
  File "/home/rolf/weather/sr3/sarracenia/sr.py", line 1020, in __init__
    self._read_configs()
  File "/home/rolf/weather/sr3/sarracenia/sr.py", line 307, in _read_configs
    cfgbody.fill_missing_options(c, cfg)
  File "/home/rolf/weather/sr3/sarracenia/config.py", line 1498, in fill_missing_options
    queuefile += os.sep + component + '.' + cfg + '.' + self.broker.url.username
TypeError: can only concatenate str (not "NoneType") to str
petersilva commented 2 years ago
cd wherever your source is.
pip install -e .

again... sometimes things get messed up?

endlisnis commented 2 years ago

I tried that before I gave my previous comment.

petersilva commented 2 years ago

can you cat your config file? this stuff works fine for the configs I have...

petersilva commented 2 years ago

perhaps try adding to ~/.config/sr3/credentials.conf:


amqps://anonymous:anonymous@hpfx.collab.science.gc.ca
endlisnis commented 2 years ago
rolf@endlisnis11 ~/weather
$ cat ~/.config/sr3/credentials.conf
amqps://anonymous:anonymous@dd.weather.gc.ca
amqps://anonymous:anonymous@hpfx.collab.science.gc.ca

rolf@endlisnis11 ~/weather
$ cat ~/.config/sr3/subscribe/rolf.conf
broker amqps://anonymous:anonymous@dd.weather.gc.ca/

queue_name q_anonymous.rolf.20211222
instances 4

directory /home/rolf/weather/swobMirror/
# All stations
topicPrefix v02.post
subtopic observations.swob-ml.#
batch 1

mirror True
reject .*/partners/.*
reject .*/moored-buoys/.*
accept .*
expire 4h
acceptSizeWrong off
inflight tmp/

rxpipe_name /home/rolf/weather/rxpipe
flowCallback rxpipe_gzip.RxPipe_gzip

rolf@endlisnis11 ~/weather
$ ~/.local/bin/sr3 stop subscribe/rolf
Traceback (most recent call last):
  File "/home/rolf/.local/bin/sr3", line 33, in <module>
    sys.exit(load_entry_point('metpx-sr3', 'console_scripts', 'sr3')())
  File "/home/rolf/weather/sr3/sarracenia/sr.py", line 2019, in main
    gs = sr_GlobalState(cfg, cfg.configurations)
  File "/home/rolf/weather/sr3/sarracenia/sr.py", line 1020, in __init__
    self._read_configs()
  File "/home/rolf/weather/sr3/sarracenia/sr.py", line 307, in _read_configs
    cfgbody.fill_missing_options(c, cfg)
  File "/home/rolf/weather/sr3/sarracenia/config.py", line 1498, in fill_missing_options
    queuefile += os.sep + component + '.' + cfg + '.' + self.broker.url.username
TypeError: can only concatenate str (not "NoneType") to str
endlisnis commented 2 years ago

I had to change the broker line to: broker https://anonymous:anonymous@hpfx.collab.science.gc.ca To make it work.

endlisnis commented 2 years ago

And then I got:

2022-04-07 19:07:58,798 [CRITICAL] sarracenia.moth ProtocolPresent Protocol scheme https unsupported for communications with message brokers
endlisnis commented 2 years ago

I got it working by switching to broker amqps://anonymous:anonymous@hpfx.collab.science.gc.ca

endlisnis commented 2 years ago

A problem that got me very confused about this exception was that even though I was only trying to stop the "rolf" config, there was a problem with the "hpfx" config file which was causing the python to crash. I was not expecting the script to fail on some OTHER config file when trying to stop one specific service. I had to add some debug code into config.py to figure out what was going on.

endlisnis commented 2 years ago

subscribe_hpfx_01.log.gz

After switching over to hpfx, it ran for a little more than 30 minutes before getting stuck (in a different, but familiar way this time).

At "19:45:03", it started getting "[Errno -2] Name or service not known", and never recovered. All the instances started at the exact same time. I wouldn't be surprised to find out that there was a momentary failure in DNS.

At "20:01:52", I noticed the problem and restarted the service. It very quickly caught up.

petersilva commented 2 years ago
endlisnis commented 2 years ago

It's puzzling, not sure what to do about it.

Why not only read in all of the configs when you need to make the consistency checks? When I'm asking to stop a specific service, I assume there's no point in doing global consistency checks.

Or, if there is a need to do global consistency checks even when stopping an individual service, you could warn about configs that are not parsable, but still continue on.

endlisnis commented 2 years ago

Well, I think I've figured out what triggers the 8-second download times. I'm using VPN software (Cisco AnyConnect). When the VPN disconnects, the sr3 instances start getting this 8 second delay on each file. If I re-connect to the VPN (without restarting sr3), then the downloading goes back to normal and catches up.

I've tried this a few times now, and it's reproducible.

petersilva commented 2 years ago

Wow! That's great! What's puzzling is why restarting sr3 ever worked? when the vpn goes off, does anything else get slow? dns lookups? ping to google or something?

endlisnis commented 2 years ago

Oh, to clarify: Disconnecting from the VPN slows down sr3. Reconnecting to the VPN restores sr3 speed. Restarting sr3 after the VPN disconnect also restores sr3 speed.

No other programs slow down on my system after the disconnect. Some part of sr3 is maintaining state that causes the delay when the VPN disconnects.

I have a feeling that this is going to turn out to be a failed DNS lookup, or trying to route the request across a non-existing network interface. The VPN tool does inject 2 extra DNS servers into /etc/resolv.conf.

Now, the funny thing is that I had no trouble reproducing this last night. 3 times in a row it would start having trouble exactly when my VPN disconnected, but now (this morning) I can't reproduce it.

I wonder if the VPN has to be connected for a while for sr3 to cache information about the alternate network.

petersilva commented 2 years ago

sr3 is a naive user of dns. Reviewing source code,

thing is, most browsers have built-in dns caching, so you cannot test that way. you need to ping or lookup a host from the command line. On Linux, there are things like nscd, and systemd-resolved.service that have rendered dns caching incomprehensible. These days I just test... and then debug from how it actually works, rather than trying to guess.

when logging in via SSL there might be some additional checks... as an A:B test, you could try adding:


tlsRigour lax

to see if the timing changes at all... it might be tls doing reverse lookup or something?

petersilva commented 2 years ago

if runing on linux... when sr3 is slow try:

    host -C www.cbc.ca

to get a more elaborate host lookup (involving consistency checks.)

endlisnis commented 2 years ago

That command outputs literally nothing when I run it on my linux machine.

rolf@endlisnis11 ~/weather/sr3
$ host -C www.cbc.ca

rolf@endlisnis11 ~/weather/sr3
$ host -C www.google.com
endlisnis commented 2 years ago

sr3 is a naive user of dns. Reviewing source code, thing is, most browsers have built-in dns caching, so you cannot test that way. you need to ping or lookup a host from the command line. On Linux, there are things like nscd, and systemd-resolved.service that have rendered dns caching incomprehensible. These days I just test... and then debug from how it actually works, rather than trying to guess.

I have tried using "curl" to download the exact url of one of the swob xml files while sr3 was in slow mode. It completed in 114ms. I assume if there is any DNS caching going on that it's probably behaving the same in sr3 as it is in curl.

Previously, you mentioned something about re-using connections (this was for the bug where it would fail once, and then keep failing forever, the bug you already fixed). Could it be related to that type of connection re-use? Maybe the routing table changes when the VPN disconnects which causes old connections to have lots of packet loss or something. The funny thing is that it does EVENTUALLY work (after ~8s).

petersilva commented 2 years ago
petersilva commented 2 years ago

the host -C thing... I had given the info based on the man page... but I have the same result as you... not useful.

endlisnis commented 2 years ago

When I use a batch size of larger than 1, the time for the whole batch is about 8× the batch size. So, yeah, maybe this isn't DNS.

endlisnis commented 2 years ago

I was able to reproduce it tonight. I tried to use the same circumstances as before (I'm not sure how many of these pieces actually matter): (1) I leave my VPN connected for a few hours. (2) I run sr3 during those hours. (3) At around 19:15, I disable the VPN.

Sure enough, I was able to reproduce the problem.

While I was in that state, I ran "host www.cbc.ca" and even

$ host hpfx.collab.science.gc.ca
hpfx.collab.science.gc.ca has address 142.98.224.27

No changes when it's run after I restart the service.

I also ran a traceroute (really an mtr) when the VPN was connected, and after the disconnect. No differences in the routes.

endlisnis commented 2 years ago

Is there same debug that I can enable that would give more output about what it's doing during that 8-second time?

endlisnis commented 2 years ago

for pipe... we could just write to an hourly file, and your reader could just stay an hour behind all the time.. so we write a new file every hour, and your reader reads all the files except the current one...

I actually came up with a better solution for this. I'm including my updated plugin. It uses an sql database instead of a named pipe. rxqueue_gzip.py.gz

petersilva commented 2 years ago

Very Cool! I would accept a PR if you want it added to Sarracenia itself... or you could just add a doc string giving distribution permissions compatible with GPLv2 (used by sarracenia) either way authorship needs to be indicated somehow.

endlisnis commented 2 years ago

The hpfx server configuration hit the same slowdown as the dd configuration was doing before when my VPN software disconnected this morning.

I restarted it at "11:27:55" and it quickly caught up.

Although I guess that hpfx allows deeper queues, because I'm normally limited to ~1700 seconds of data, whereas I got well over 5000s on hpfx.

subscribe_hpfx_01.log.gz

petersilva commented 2 years ago

OK so that eliminates server side as a problem. The solution is now very likely within sr3 itself. I will look at the download code more closely.

endlisnis commented 2 years ago

Very Cool! I would accept a PR if you want it added to Sarracenia itself... or you could just add a doc string giving distribution permissions compatible with GPLv2 (used by sarracenia) either way authorship needs to be indicated somehow.

I just created a pull request: https://github.com/MetPX/sarracenia/pull/512

endlisnis commented 2 years ago

The solution is now very likely within sr3 itself. I will look at the download code more closely.

Any updates on this? I still experience this slowdown almost every day when my VPN disconnects.

petersilva commented 2 years ago

no, have not had time to look at this...

petersilva commented 1 year ago

OK, I happenned to have a good opportunity, debugging on a windows laptop running Cisco AnyConnect, and I disconnected, and re-connected... when I did it it, I saw, notice immediately that the connection had died, and it reconnected and continued:

2023-08-29 13:26:35,620 [INFO] 20456 sarracenia.flowcb.log after_accept accepted: (lag: 212.84 ) https://hpfx.collab.science.gc.ca /20230829/WXO-DD/bulletins/alphanumeric/20230829/SX/KWAL/17/SXCN40_KWAL_291722___6464
2023-08-29 13:26:36,792 [WARNING] 20456 sarracenia.moth.amqp ack failed for tag: 562: EOF occurred in violation of protocol (_ssl.c:2426)
2023-08-29 13:26:37,074 [INFO] 20456 sarracenia.moth.amqp getSetup queue declared q_anonymous_subscribe.hpfx_amis.90328667.61260748 (as: amqps://anonymous@hpfx.collab.science.gc.ca/)
2023-08-29 13:26:37,074 [INFO] 20456 sarracenia.moth.amqp getSetup binding q_anonymous_subscribe.hpfx_amis.90328667.61260748 with v02.post.*.WXO-DD.bulletins.alphanumeric.# to xpublic (as: amqps://anonymous@hpfx.collab.science.gc.ca/)
2023-08-29 13:26:37,106 [INFO] 20456 sarracenia.moth.amqp ack Sleeping 2 seconds before re-trying ack...
2023-08-29 13:26:39,116 [INFO] 20456 sarracenia.flowcb.log after_work downloaded ok: c:\Users\silvap2\temp\hpfx_amis/SXCN40_KWAL_291722___6464
2023-08-29 13:26:39,117 [INFO] 20456 sarracenia.flow run current_rate/2 (3.52) above messageRateMax(1.00): throttling
2023-08-29 13:26:39,117 [INFO] 20456 sarracenia.flow run current_rate (3.52) vs. messageRateMax(1.00))
2023-08-29 13:26:39,178 [WARNING] 20456 sarracenia.moth.amqp getNewMessage failed q_anonymous_subscribe.hpfx_amis.90328667.61260748: Basic.ack: (406) PRECONDITION_FAILED - unknown delivery tag 562
2023-08-29 13:26:39,179 [WARNING] 20456 sarracenia.moth.amqp getNewMessage lost connection to broker
2023-08-29 13:26:40,222 [INFO] 20456 sarracenia.flow run current_rate/2 (3.50) above messageRateMax(1.00): throttling
2023-08-29 13:26:40,222 [INFO] 20456 sarracenia.flow run current_rate (3.50) vs. messageRateMax(1.00))
2023-08-29 13:26:40,473 [INFO] 20456 sarracenia.moth.amqp getSetup queue declared q_anonymous_subscribe.hpfx_amis.90328667.61260748 (as: amqps://anonymous@hpfx.collab.science.gc.c

There has been a lot of intervening work regarding dealing with failures of many sorts, so we may have just solved this problem while working on other things.

It might be worth trying again with a current release.

petersilva commented 1 year ago

I confirmed by disconnecting and re-connecting a couple of times looking at lag, and it's the same regardless. lag doesn't change based on change in VPN status.

endlisnis commented 1 year ago

Well, I was originally running the Linux version of Cisco AnyConnect, not the windows version.

But, I no longer run it anyway. I migrated my SR3 installation to a raspberry pi, (for other reasons), so I can neither confirm nor deny that it's fixed.

petersilva commented 1 year ago

OK... well it does not look like we will make any further progress on this. closing it for now... we can always re-open if it becomes relevant again.