cloudfoundry / gorouter

CF Router
Apache License 2.0
441 stars 226 forks source link

Enhance gorouter retry logic #437

Open b1tamara opened 2 months ago

b1tamara commented 2 months ago

Current behavior

const app = express();

app.get('/', async (req, res) => { await setTimeout(5000); res.send('Hello, World!'); });

const server = http.createServer(app); server.maxConnections = 1000;

const PORT = process.env.PORT || 3000; server.listen(PORT, () => { console.log(Server is running on http://localhost:${PORT}); console.log(Max connections: ${server.maxConnections}); });


Start to send requests:
`hey -c 1000 -n 1000 -H "X-Cf-App-Instance":"1edeb42a-076a-41a2-84a6-a70056e9938e:0" "https://retry-test.<domain>"`
After 1-2 seconds:
`hey -c 1000 -n 1000 "https://retry-test.<domain>"`

All 1000 requests with the CF-App-Instance Header are successful. 
Ca. 1/4 of request without  CF-App-Instance Header finished with 502 as gorouter sent retries to the same app instance.

Observations: 
- Every RequestInfo has a pointer to the [EndpointPool](https://github.com/cloudfoundry/gorouter/blob/6331451d30dbf996cc9132ab617f07b25707b06b/handlers/lookup.go#L108). For the same uri(route) it points to the same EndpointPool. 
- The issue is related to the iteration [over pool endpoints](https://github.com/cloudfoundry/gorouter/blob/6331451d30dbf996cc9132ab617f07b25707b06b/route/roundrobin.go#L74). 
- Gorouter gets a lot of concurrent requests to the same route. 
- Max_retries = number of app instances.
- A couple of seconds the app instance with instance_id = 0 cannot accept new connections because of max connection limit in the app code. 

Found logs (written by the same gorouter):

"timestamp":"2024-09-10T09:46:47.350182234Z","message":"backend","data":{"instance_id":"0"},"attempt":1,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"

"timestamp":"2024-09-10T09:46:47.350406060Z","message":"backend","data":{"instance_id":"1"},"attempt":1,"vcap_request_id":"ffed058e-a289-4ca9-7d22-0e345c2bd54f"

"timestamp":"2024-09-10T09:46:47.357750671Z","message":"backend","data":{"instance_id":"0"},"attempt":2,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"

"timestamp":"2024-09-10T09:46:47.357588430Z","message":"backend-endpoint-failed","data":{"instance_id":"0"},"error":"EOF (via idempotent request)","attempt":1,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"

"timestamp":"2024-09-10T09:46:47.363893483Z","message":"backend-endpoint-failed","data":{"instance_id":"0"},"error":"EOF (via idempotent request)","attempt":2,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"


The request with vcap_request_id=703d8ce7-b187-41b6-4a02-a122d1ee8b60 was sent to the same instance_id=0 two times, reached the max_retries value and finished with 502. 

In other words according to the [Round Robin logic for seeking of next endpoint](https://github.com/cloudfoundry/gorouter/blob/6331451d30dbf996cc9132ab617f07b25707b06b/route/roundrobin.go#L91):
The Gorouter gets a first request (703d8ce7-b187-41b6-4a02-a122d1ee8b60) ->
- pool.currentIndex=0
- pool.nextIdx = 1  
- attempt = 1
- selectedEndpoint: app instance with instance_id=0
- Result: backend-endpoint-failed (EOF) & It should be retried. 

The Gorouter gets a request (ffed058e-a289-4ca9-7d22-0e345c2bd54f)
- pool.currentIndex=1
- pool.nextIdx = 0
- attempt=1
- selectedEndpoint: app instance with instance_id=1
- Result: Success

The Gorouter retries the first request (703d8ce7-b187-41b6-4a02-a122d1ee8b60) ->
- pool.currentIndex=0
- pool.nextIdx = 1  
- attempt=2
- selectedEndpoint: app instance with instance_id=0
- Result: backend-endpoint-failed (EOF) & 502 as max_retries reached. 

EOF is classified as retriable backend error but it does not belongt to FailableClassifiers.

### Desired behavior

- All requests have to be successful as two app instances can accept and process 2000 requests. 
- Gorouter's retry should not go to the same app instance twice. 

### Affected Version

routing-release: 0.301
maxmoehl commented 1 month ago

We discussed this further offline: the underlying issue is that the retry mechanism makes assumptions about the round robin iterator which are not true. It assumes that it will always go over every endpoint exactly once when it calls next (we documented that but apparently never properly verified it 😞) which is not true because the iterator uses a global counter from the referenced pool which may be incremented by parallel requests.

The only simple solution I can think of right now is to introduce local round robin to each iterator but this will most likely make it less round robin if you get what I mean 😄

A middle ground would be to use the pool index to "seed" each iterator, e.g. give them their first index since we assume that most requests will pass, and next calls on that iterator will then loop over the entire pool exactly once.