IPFS gateway resets - Githubissues

mikeshultz commented 4 years ago

I thought we already had an issue for this but maybe not.

Our prod IPFS node cluster have been having issues with traffic to the IPFS gateway that's routed through an openresty reverse proxythat handles AutoSSL stuff(ipfs-issuer, and ipfs-cluster-issuer). Traffic that flows through the "standard" nginx-ingress(also OpenResty) is unaffected. Rebuilding the issuer never has an effect on the error state. Rebuilding the IPFS node(s) does, usually, restore things to a functional state.

Example error log entries from the issuer:

2020/07/26 21:55:55 [error] 20#20: *3804 recv() failed (104: Connection reset by peer) while reading upstream, client: 10.138.0.16, server: , request: "GET /stayhome/unisex-masked-purple-devil-t-shirt/520/98c3c71c06f3633148adf3e2e7a9e9ec.jpg HTTP/1.1", upstream: "http://10.11.249.161:8080/stayhome/unisex-masked-purple-devil-t-shirt/520/98c3c71c06f3633148adf3e2e7a9e9ec.jpg", host: "www.stayathomeshop.co", referrer: "https://www.stayathomeshop.co/"
2020/07/26 21:55:55 [error] 20#20: *3808 recv() failed (104: Connection reset by peer) while reading upstream, client: 10.138.0.15, server: , request: "GET /stayhome/unisex-masked-white-devil-t-shirt/520/7f0f560b0c63845f8573279b6ce7867f.jpg HTTP/1.1", upstream: "http://10.11.249.161:8080/stayhome/unisex-masked-white-devil-t-shirt/520/7f0f560b0c63845f8573279b6ce7867f.jpg", host: "www.stayathomeshop.co", referrer: "https://www.stayathomeshop.co/"
2020/07/26 21:55:56 [error] 20#20: *3822 recv() failed (104: Connection reset by peer) while sending to client, client: 10.8.0.1, server: , request: "GET /app.acb54700.css HTTP/1.1", upstream: "http://10.11.249.161:8080/app.acb54700.css", host: "store.shoop.com", referrer: "https://store.shoop.com/"
2020/07/26 21:55:57 [error] 20#20: *3825 recv() failed (104: Connection reset by peer) while reading upstream, client: 10.138.0.17, server: , request: "GET /shoop/shoop-bat-flip-flops/520/e165381884e139eb0319dbaec530cef3.png HTTP/1.1", upstream: "http://10.11.249.161:8080/shoop/shoop-bat-flip-flops/520/e165381884e139eb0319dbaec530cef3.png", host: "store.shoop.com", referrer: "https://store.shoop.com/"
2020/07/26 21:55:57 [error] 20#20: *3818 recv() failed (104: Connection reset by peer) while reading upstream, client: 10.138.0.17, server: , request: "GET /shoop/shoop-mug/520/7f8093736262bbc4e468eacefc926c54.png HTTP/1.1", upstream: "http://10.11.249.161:8080/shoop/shoop-mug/520/7f8093736262bbc4e468eacefc926c54.png", host: "store.shoop.com", referrer: "https://store.shoop.com/"

When fetching a specific CSS file and checking file sizes, it seems like the resets happen at seemingly random points in the file transfer. Here's a selection of response sizes seen:

52571
56667
56584
52488
157824
162048

complete: 214,465

Changes made(to no effect):

Removed @origin/ipfs-proxy from the loop
set proxy_buffering off
disable HTTP/2

I think there were more but not remembering. Will keep this issue updated.

Related: #198

mikeshultz commented 4 years ago

Both instances on the same node(always?). Checked some system level params on the node and they all seem fine.

Open files are well under the max and don't seem like an unreasonable number:

# cat /proc/sys/fs/file-nr
6112    0   3076506

Connections didn't seem ridiculous:

# netstat -na | grep tcp | wc -l
221

Tried making the changes that were made on the single-node instance we have(turning off proxy buffering and removing http2) to no apparent effect.

Testing the file sizes coming from the nodes from the issuer showed no issue:

curl -H 'Host: celyn.ogn.app' http://10.8.2.181:8080/app.ace7f8e8.css | wc -c
curl -H 'Host: celyn.ogn.app' http://10.8.0.186:8080/app.ace7f8e8.css | wc -c
curl -H 'Host: celyn.ogn.app' http://prod-ipfs-cluster.prod.svc.cluster.local:8080/app.ace7f8e8.css | wc -c

This at least seems to confirm that the OpenResty instances or the autossl module being used for the issuer are likely the cause of the issue. Though it's not OpenResty specifically since nginx-ingress is built on the same stuff. Will try and update the issuer image with configuration changes made for the ingress(mostly proxy config) and get that ready for testing next time this problem arises.

Will do some further research and perhaps see if I can gather something coherent and approach go-ipfs about it, though there's not a ton of useful information gathered yet.

mikeshultz commented 4 years ago

Next time this happens it will probably be worth trying to capture some of these conversations to see the resets happen and who actually initiates the reset. Might find some other underlying issues with the conversation and it might give us something to take to go-ipfs. There were issues with nginx+go-ipfs before with file adds so it wouldn't surprise me here.

# Install tcpdump
apk add tcpdump

# Run capture for a while until satisfied some reset requests came through
tcpdump dst port 8080 -w /tmp/reset_capture.pcap

# exfil the capture for analysis
kubectl -n prod cp prod-ipfs-cluster-issuer-0:/tmp/reset_capture.pcap /tmp/reset_capture.pcap

mikeshultz commented 4 years ago

Some research from yesterday into the nginx/openresty config:

proxy_http_version defaults to HTTP/1.0 so there should be no HTTP2 communication. http2 was also disabled for client <-> openresty comms.

proxy_request_buffering was turned off, no change.

proxy_socket_keepalive is off by default, so there shouldn't be any lingering connections between openresty and the go-ipfs gateway.

keepalive is also off by default, so there shouldn't be any lingering connections between the client and openresty.

mikeshultz commented 4 years ago

Was able to successfully reproduce this issue on staging! I pulled all the URLs loaded for a deployed store and laid siege against them. The situation quickly devolved into the defunct state described above. So, seems like load triggers it, which is unfortunate because it wasn't that much load(15-20rps).

Got a couple of packet captures from the PoV of the issuer.

Reset appears to be coming from the go-ipfs gateway(port 8080). I would also like to get a capture from the PoV of go-ipfs and make sure the packets aren't arriving at that pod all chunked up. Unlikely, but good to be thorough. To me, nothing jumps out in these captures. Everything appears nominal until the RST.

I'm going to leave this system in this defunct state for the night but it doesn't appear that it will recover on its own, even when the load goes away.

nick commented 4 years ago

Awesome we can reproduce it 😊

mikeshultz commented 4 years ago

After bouncing the IPFS pods to reset them back to a working state, I can't seem to repro today. I'm even sending 3x the traffic as yesterday. It seems perfectly happy at ~80rps. :confused:

There's been a new IPFS release(0.6.0). Gonna test and roll that out to staging today with prod following after it looks good. There's been a bunch of changes so one can hope there's a non-obvious relevant fix.

Also want to get an IPFS debugging container ready(something we can install packages on(alpine?)) in the mean time so when I can repro this again, I can get a dump from the gateway's PoV.

nick commented 4 years ago

Just when we thought we had it cornered 🦠😬

franckc commented 4 years ago

Darn :( hmmm... the fact that we can't consistently reproduce it despite using the same synthetic load makes me wonder if the issue is somehow related to the level of network activity with peers? Not very helpful - I'm just thinking loud here...

mikeshultz commented 4 years ago

A new avenue I'll be exploring:

The IPFS folks seem to want to blame Kubernetes as well(though there's no evidence either way, yet). Worth considering. This also suggests to us a potential "out" is to build an IPFS cluster outside of Kubernetes or on a new cluster with a newer version. Might not be a bad idea to move off the old cluster, anyway.

Will update if anything comes from this.

nick commented 4 years ago

Nice detective work 🕵️‍♂️

mikeshultz commented 4 years ago

So, after the 0.6.0 upgrade the resets came back and no amount of recreating pods would fix the issue. I did notice one(may be others, I only checked a couple that had these pods) kubernetes node had a fuckload of file handles in use:

# Normal?
$ cat /proc/sys/fs/file-nr
7552    0   3076506

# Pretty high?
$ cat /proc/sys/fs/file-nr
328352  0   3076512

This was quite an old node(612d old). The old nodes, the fact that I couldn't coax things back into a working state, and the above linked issue which was fixed in Kubernetes 1.15 lead me to do an upgrade this evening. Went smooth, everything should be back to normal. No resets seen(yet).

I'm not ready to call this solved, but I'm guardedly optimistic. I'll continue getting that go-ipfs debug docker image ready to go just in case.

mikeshultz commented 4 years ago

I think the upgrade to 1.15 took care of this issue. Either it was a due to the known bug in 1.14, or rebuilding underlying nodes helped. Not sure but let's call this solved. Hasn't shown up in over a week. Will reopen if it shows up again.

mikeshultz commented 4 years ago

Appears to be back with #463 Looking...

mikeshultz commented 3 years ago

Showed back up again this morning, coincidentally(or probably not?) a GKE node stopped responding at about the same time I started investigating this issue. Kind a bit more evidence this is a Kubernetes networking issue.

Continuing to bring the Kubernetes nodes up to modernity is probably a good call but not exactly top priority right now.

OriginProtocol / dshop

IPFS gateway resets #304