Open mikeshultz opened 4 years ago
Both instances on the same node(always?). Checked some system level params on the node and they all seem fine.
Open files are well under the max and don't seem like an unreasonable number:
# cat /proc/sys/fs/file-nr
6112 0 3076506
Connections didn't seem ridiculous:
# netstat -na | grep tcp | wc -l
221
Tried making the changes that were made on the single-node instance we have(turning off proxy buffering and removing http2) to no apparent effect.
Testing the file sizes coming from the nodes from the issuer showed no issue:
curl -H 'Host: celyn.ogn.app' http://10.8.2.181:8080/app.ace7f8e8.css | wc -c
curl -H 'Host: celyn.ogn.app' http://10.8.0.186:8080/app.ace7f8e8.css | wc -c
curl -H 'Host: celyn.ogn.app' http://prod-ipfs-cluster.prod.svc.cluster.local:8080/app.ace7f8e8.css | wc -c
This at least seems to confirm that the OpenResty instances or the autossl module being used for the issuer are likely the cause of the issue. Though it's not OpenResty specifically since nginx-ingress is built on the same stuff. Will try and update the issuer image with configuration changes made for the ingress(mostly proxy config) and get that ready for testing next time this problem arises.
Will do some further research and perhaps see if I can gather something coherent and approach go-ipfs about it, though there's not a ton of useful information gathered yet.
Next time this happens it will probably be worth trying to capture some of these conversations to see the resets happen and who actually initiates the reset. Might find some other underlying issues with the conversation and it might give us something to take to go-ipfs. There were issues with nginx+go-ipfs before with file adds so it wouldn't surprise me here.
# Install tcpdump
apk add tcpdump
# Run capture for a while until satisfied some reset requests came through
tcpdump dst port 8080 -w /tmp/reset_capture.pcap
# exfil the capture for analysis
kubectl -n prod cp prod-ipfs-cluster-issuer-0:/tmp/reset_capture.pcap /tmp/reset_capture.pcap
Some research from yesterday into the nginx/openresty config:
proxy_http_version
defaults to HTTP/1.0 so there should be no HTTP2 communication. http2 was also disabled for client <-> openresty comms.
proxy_request_buffering
was turned off, no change.
proxy_socket_keepalive
is off by default, so there shouldn't be any lingering connections between openresty and the go-ipfs gateway.
keepalive
is also off by default, so there shouldn't be any lingering connections between the client and openresty.
Was able to successfully reproduce this issue on staging! I pulled all the URLs loaded for a deployed store and laid siege
against them. The situation quickly devolved into the defunct state described above. So, seems like load triggers it, which is unfortunate because it wasn't that much load(15-20rps).
Got a couple of packet captures from the PoV of the issuer.
Reset appears to be coming from the go-ipfs gateway(port 8080). I would also like to get a capture from the PoV of go-ipfs and make sure the packets aren't arriving at that pod all chunked up. Unlikely, but good to be thorough. To me, nothing jumps out in these captures. Everything appears nominal until the RST.
I'm going to leave this system in this defunct state for the night but it doesn't appear that it will recover on its own, even when the load goes away.
Awesome we can reproduce it 😊
After bouncing the IPFS pods to reset them back to a working state, I can't seem to repro today. I'm even sending 3x the traffic as yesterday. It seems perfectly happy at ~80rps. :confused:
There's been a new IPFS release(0.6.0). Gonna test and roll that out to staging today with prod following after it looks good. There's been a bunch of changes so one can hope there's a non-obvious relevant fix.
Also want to get an IPFS debugging container ready(something we can install packages on(alpine?)) in the mean time so when I can repro this again, I can get a dump from the gateway's PoV.
Just when we thought we had it cornered 🦠😬
Darn :( hmmm... the fact that we can't consistently reproduce it despite using the same synthetic load makes me wonder if the issue is somehow related to the level of network activity with peers? Not very helpful - I'm just thinking loud here...
A new avenue I'll be exploring:
The IPFS folks seem to want to blame Kubernetes as well(though there's no evidence either way, yet). Worth considering. This also suggests to us a potential "out" is to build an IPFS cluster outside of Kubernetes or on a new cluster with a newer version. Might not be a bad idea to move off the old cluster, anyway.
Will update if anything comes from this.
Nice detective work 🕵️♂️
So, after the 0.6.0 upgrade the resets came back and no amount of recreating pods would fix the issue. I did notice one(may be others, I only checked a couple that had these pods) kubernetes node had a fuckload of file handles in use:
# Normal?
$ cat /proc/sys/fs/file-nr
7552 0 3076506
# Pretty high?
$ cat /proc/sys/fs/file-nr
328352 0 3076512
This was quite an old node(612d old). The old nodes, the fact that I couldn't coax things back into a working state, and the above linked issue which was fixed in Kubernetes 1.15 lead me to do an upgrade this evening. Went smooth, everything should be back to normal. No resets seen(yet).
I'm not ready to call this solved, but I'm guardedly optimistic. I'll continue getting that go-ipfs debug docker image ready to go just in case.
I think the upgrade to 1.15 took care of this issue. Either it was a due to the known bug in 1.14, or rebuilding underlying nodes helped. Not sure but let's call this solved. Hasn't shown up in over a week. Will reopen if it shows up again.
Appears to be back with #463 Looking...
Showed back up again this morning, coincidentally(or probably not?) a GKE node stopped responding at about the same time I started investigating this issue. Kind a bit more evidence this is a Kubernetes networking issue.
Continuing to bring the Kubernetes nodes up to modernity is probably a good call but not exactly top priority right now.
I thought we already had an issue for this but maybe not.
Our prod IPFS node cluster have been having issues with traffic to the IPFS gateway that's routed through an openresty reverse proxythat handles AutoSSL stuff(ipfs-issuer, and ipfs-cluster-issuer). Traffic that flows through the "standard" nginx-ingress(also OpenResty) is unaffected. Rebuilding the issuer never has an effect on the error state. Rebuilding the IPFS node(s) does, usually, restore things to a functional state.
Example error log entries from the issuer:
When fetching a specific CSS file and checking file sizes, it seems like the resets happen at seemingly random points in the file transfer. Here's a selection of response sizes seen:
Changes made(to no effect):
@origin/ipfs-proxy
from the loopproxy_buffering off
I think there were more but not remembering. Will keep this issue updated.
Related: #198