cloudfoundry / routing-release

This is the BOSH release for cloud foundry routers
Apache License 2.0
42 stars 107 forks source link

Deleting a tcp route did not trigger a config update for the tcp router #189

Closed ameowlia closed 1 year ago

ameowlia commented 4 years ago

Issue

Sometimes when a tcp route is deleted, the haproxy config (/var/vcap/jobs/tcp_router/config/haproxy.conf) does not change. This means that the HAProxy that is still running can still connect to an app using the deleted route. In Routing Release 0.204.0 the logic for how often the config is updated was changed. This change likely introduced this bug. Before version 0.204.0, the config reloaded every 1 minute (or so).

Affected Versions

Routing Release 0.204.0 and above.

Context

Setup

First we get the delete event for 10.0.8.20

{"timestamp":"2020-11-18T03:46:56.017777260Z","message":"tcp-router.handle-event.starting","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.20","port":1040,"modification_tag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"615"}} {"timestamp":"2020-11-18T03:46:56.017840688Z","message":"tcp-router.handle-event.converting-tcp-route-mapping","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.20","port":1040,"modification_tag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"615","tcp-route":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.20","port":1040,"modification_tag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"ttl":10,"isolation_segment":""}}}

Correctly remove 10.0.8.20 from the routing table

{"timestamp":"2020-11-18T03:46:56.017865346Z","message":"tcp-router.routing-table.delete-backend.removing-from-table","data":{"info":{"Address":"10.0.8.20","Port":3306,"ModificationTag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"TTL":10},"key":{"Port":1040},"new":{"ModificationTag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"TTL":10,"UpdatedTime":"2020-11-18T03:46:56.017862046Z"},"old":{"ModificationTag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"TTL":10,"UpdatedTime":"2020-11-18T03:46:40.884354603Z"},"session":"2.424"}}

Correctly reconfigure the tcp-router to remove 10.0.8.20

{"timestamp":"2020-11-18T03:46:56.017896523Z","message":"tcp-router.handle-event.calling-configurer","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.20","port":1040,"modification_tag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"615"}} {"timestamp":"2020-11-18T03:46:56.017909478Z","message":"tcp-router.reading-config-file","data":{"config-file":"/var/vcap/data/tcp_router/config/haproxy.conf"}} {"timestamp":"2020-11-18T03:46:56.018122346Z","message":"tcp-router.writing-config","data":{"num-bytes":722}} {"timestamp":"2020-11-18T03:46:56.018225965Z","message":"tcp-router.running-script","data":{}} {"timestamp":"2020-11-18T03:46:56.035922948Z","message":"tcp-router.running-script","data":{"output":""}}

correctly finish delete event for 10.0.8.20

{"timestamp":"2020-11-18T03:46:56.035967937Z","message":"tcp-router.handle-event.finished","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.20","port":1040,"modification_tag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"615"}}

At this point there are two routes. One for each ip: 10.0.8.19 and 10.0.8.21

Then we get the delete event for 10.0.8.19

{"timestamp":"2020-11-18T03:46:56.072056065Z","message":"tcp-router.handle-event.starting","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.19","port":1040,"modification_tag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"618"}} {"timestamp":"2020-11-18T03:46:56.072076056Z","message":"tcp-router.handle-event.converting-tcp-route-mapping","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.19","port":1040,"modification_tag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"618","tcp-route":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.19","port":1040,"modification_tag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"ttl":10,"isolation_segment":""}}}

Correctly remove 10.0.8.19 from the routing table

{"timestamp":"2020-11-18T03:46:56.072097421Z","message":"tcp-router.routing-table.delete-backend.removing-from-table","data":{"info":{"Address":"10.0.8.19","Port":3306,"ModificationTag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"TTL":10},"key":{"Port":1040},"new":{"ModificationTag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"TTL":10,"UpdatedTime":"2020-11-18T03:46:56.072093808Z"},"old":{"ModificationTag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"TTL":10,"UpdatedTime":"2020-11-18T03:46:39.774615697Z"},"session":"2.427"}}

Correctly reconfigure the tcp-router to remove 10.0.8.19

{"timestamp":"2020-11-18T03:46:56.072117404Z","message":"tcp-router.handle-event.calling-configurer","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.19","port":1040,"modification_tag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"618"}} {"timestamp":"2020-11-18T03:46:56.072133421Z","message":"tcp-router.reading-config-file","data":{"config-file":"/var/vcap/data/tcp_router/config/haproxy.conf"}} {"timestamp":"2020-11-18T03:46:56.072273034Z","message":"tcp-router.writing-config","data":{"num-bytes":536}} {"timestamp":"2020-11-18T03:46:56.072369613Z","message":"tcp-router.running-script","data":{}} {"timestamp":"2020-11-18T03:46:56.089599918Z","message":"tcp-router.running-script","data":{"output":""}}

correctly finish delete event for 10.0.8.19

{"timestamp":"2020-11-18T03:46:56.089643035Z","message":"tcp-router.handle-event.finished","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.19","port":1040,"modification_tag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"618"}}

At this point the routing table has one route, and it is for the ip 10.0.8.21

Then a pruning cycle starts and finishes

{"timestamp":"2020-11-18T03:46:59.978084626Z","message":"tcp-router.prune-stale-routes.starting","data":{"session":"619"}} {"timestamp":"2020-11-18T03:46:59.978172568Z","message":"tcp-router.prune-stale-routes.completed","data":{"session":"619"}}

Then a bulk sync cycle starts

{"timestamp":"2020-11-18T03:46:59.979586070Z","message":"tcp-router.bulk-sync.starting","data":{"session":"620"}} {"timestamp":"2020-11-18T03:46:59.979623864Z","message":"tcp-router.uaa-client.started-fetching-token","data":{"endpoint":"https://uaa.service.cf.internal:8443/oauth/token","force-update":false,"session":"621"}} {"timestamp":"2020-11-18T03:46:59.979649253Z","message":"tcp-router.uaa-client.using-cached-token","data":{"session":"621"}} {"timestamp":"2020-11-18T03:46:59.979951782Z","message":"tcp-router.get-stats.start","data":{"session":"622"}} {"timestamp":"2020-11-18T03:46:59.980031711Z","message":"tcp-router.get-stats.connection-successful","data":{"session":"622"}} {"timestamp":"2020-11-18T03:46:59.980047206Z","message":"tcp-router.get-stats.sent-stats-command","data":{"session":"622"}} {"timestamp":"2020-11-18T03:46:59.980356539Z","message":"tcp-router.get-stats.num-bytes-read","data":{"count":1489,"session":"622"}} {"timestamp":"2020-11-18T03:46:59.980449271Z","message":"tcp-router.get-stats.completed","data":{"session":"622"}}

At this point the bulk sync indicates that there should be ZERO routes

{"timestamp":"2020-11-18T03:46:59.982061998Z","message":"tcp-router.bulk-sync.fetched-tcp-routes","data":{"num-routes":0,"session":"620"}} {"timestamp":"2020-11-18T03:46:59.982093986Z","message":"tcp-router.bulk-sync.applying-cached-events","data":{"cache_size":0,"session":"620"}} {"timestamp":"2020-11-18T03:46:59.982104953Z","message":"tcp-router.bulk-sync.applied-cached-events","data":{"session":"620"}}

Then a bulk sync cycle finishes WITHOUT changing the routing table or updating the haproxy config, even though it found out that is supposed to have ZERO routes.

{"timestamp":"2020-11-18T03:46:59.982113253Z","message":"tcp-router.bulk-sync.completed","data":{"session":"620"}}

Then we get the delete event for 10.0.8.21

{"timestamp":"2020-11-18T03:47:06.017581867Z","message":"tcp-router.handle-event.starting","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.21","port":1040,"modification_tag":{"guid":"5ac38837-5e52-4f4e-7cef-68b570e09a9f","index":22},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"623"}} {"timestamp":"2020-11-18T03:47:06.017631673Z","message":"tcp-router.handle-event.converting-tcp-route-mapping","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.21","port":1040,"modification_tag":{"guid":"5ac38837-5e52-4f4e-7cef-68b570e09a9f","index":22},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"623","tcp-route":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.21","port":1040,"modification_tag":{"guid":"5ac38837-5e52-4f4e-7cef-68b570e09a9f","index":22},"ttl":10,"isolation_segment":""}}}

Then the delete event for 10.0.8.21 finishes incorrectly WITHOUT changing the routing table or updating the haproxy config

{"timestamp":"2020-11-18T03:47:06.017651057Z","message":"tcp-router.handle-event.finished","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.21","port":1040,"modification_tag":{"guid":"5ac38837-5e52-4f4e-7cef-68b570e09a9f","index":22},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"623"}}



## Steps to Reproduce

TODO

## Expected result

When a tcp route is deleted, then the tcp router is updated to reflect that.

## Current result

When a tcp route is deleted, the tcp router is not always updated.

## Workarounds

1. Monit restart the tcp-router
1. "Jiggling" the tcp-router by adding/deleting another route.

## Possible Fix

There is likely a race condition with [this code](https://github.com/cloudfoundry/cf-tcp-router/commit/f0e04b3bda6db572cbd4a26cae7558a25883a39a). and the bulk sync.

## More context

This is seen on a somewhat frequent basis in tests that have a relatively low number of routes (It's a test env not a prod env) and yet we are seeing this fairly frequently. I worry about how often this happens with a larger number of moving pieces. 

cc: @ssapra 
geofffranks commented 1 year ago

This has been completed as of routing-release 0.260.0.