appsembler / configuration

a simple, but flexible, way for anyone to stand up an instance of the edX platform that is fully configured and ready-to-go
GNU Affero General Public License v3.0
15 stars 13 forks source link

don't restart nginx, block ports to remove from lb #411

Closed OmarIthawi closed 2 years ago

OmarIthawi commented 2 years ago

This is a proposal to remove nginx from load balancer without risking breaking it.

Status

This pull request is mostly to discuss the idea. We've had a ton of proposals and discussions so far on

This is my last attempt to find a low hanging fruit to reduce the impact of GSFuse hanging during deploys.

The only other viable solution currently is: https://appsembler.atlassian.net/l/cp/JMywXNn2

Checklist

thraxil commented 2 years ago

The main problem with this approach is that it kills in-flight requests. If you stop the nginx process, it does a graceful shutdown where it immediately stops accepting new requests, then waits for existing ones to complete, then actually shuts down. Just cutting things off at the firewall level just blocks packets and anything that's currently in-flight can be cut off partially complete, or it can complete on the backend/django side but not be able to return a response.

thraxil commented 2 years ago

We'd also need to test what the LB actually does. When it tries to healthcheck on a VM where nginx is stopped, it gets an immediately failure response. Dropping packets with the firewall usually looks more like a slow server and you have to wait for the TCP packets to time out (or some other timeout that the LB might have) before you can definitively say "this backend is unhealthy and needs to be removed from the pool".

OmarIthawi commented 2 years ago

The main problem with this approach is that it kills in-flight requests. If you stop the nginx process, it does a graceful shutdown where it immediately stops accepting new requests, then waits for existing ones to complete, then actually shuts down. Just cutting things off at the firewall level just blocks packets and anything that's currently in-flight can be cut off partially complete, or it can complete on the backend/django side but not be able to return a response.

This is the expected behavior, and yes it's not acceptable.

I spent 30 minutes trying to verify this behavior without luck to block 443 from staging-tahoe-us-juniper-edxapp-1.

I'm going to stop experimenting here because this no longer the low hanging fruit I was hoping to achieve.

OmarIthawi commented 2 years ago

🏳️