Sometimes when a tcp route is deleted, the haproxy config (/var/vcap/jobs/tcp_router/config/haproxy.conf) does not change. This means that the HAProxy that is still running can still connect to an app using the deleted route. In Routing Release 0.204.0 the logic for how often the config is updated was changed. This change likely introduced this bug. Before version 0.204.0, the config reloaded every 1 minute (or so).
We have 3 BOSH VMs that share a tcp route mapping (backend ips: 10.0.8.19, 10.0.8.20, and 10.0.8.21, backend port 3306 and tcp port 1040)
The tcp_router logs show the 2 Delete TcpEvents arrive (for 10.0.8.20 and 10.0.8.19) and cause /var/vcap/data/tcp_router/config/haproxy.conf to be re-configured appropriately. Then the scheduled sync is triggered and it logs that there are 0 tcp routes returned by the routing API. I don't see anything to indicate that the 3rd backend ip 10.0.8.21 is removed from the haproxy.conf.
Soon after, the third Delete TcpEvent arrives for 10.0.8.21 but the event is handled and the lack of logs indicate the haproxy.conf is not re-configured.
When we look at the haxproxy.conf, we see there is still the listen section for 10.0.8.21:
listen listen_cfg_1040
mode tcp
bind :1040
server server_10.0.8.21_3306 10.0.8.21:3306
Logs
# At the start there are three routes. One for each ip: 10.0.8.19, 10.0.8.20, 10.0.8.21
Then a bulk sync cycle finishes WITHOUT changing the routing table or updating the haproxy config, even though it found out that is supposed to have ZERO routes.
## Steps to Reproduce
TODO
## Expected result
When a tcp route is deleted, then the tcp router is updated to reflect that.
## Current result
When a tcp route is deleted, the tcp router is not always updated.
## Workarounds
1. Monit restart the tcp-router
1. "Jiggling" the tcp-router by adding/deleting another route.
## Possible Fix
There is likely a race condition with [this code](https://github.com/cloudfoundry/cf-tcp-router/commit/f0e04b3bda6db572cbd4a26cae7558a25883a39a). and the bulk sync.
## More context
This is seen on a somewhat frequent basis in tests that have a relatively low number of routes (It's a test env not a prod env) and yet we are seeing this fairly frequently. I worry about how often this happens with a larger number of moving pieces.
cc: @ssapra
Issue
Sometimes when a tcp route is deleted, the haproxy config (
/var/vcap/jobs/tcp_router/config/haproxy.conf
) does not change. This means that the HAProxy that is still running can still connect to an app using the deleted route. In Routing Release 0.204.0 the logic for how often the config is updated was changed. This change likely introduced this bug. Before version 0.204.0, the config reloaded every 1 minute (or so).Affected Versions
Routing Release 0.204.0 and above.
Context
Setup
Logs
First we get the delete event for 10.0.8.20
{"timestamp":"2020-11-18T03:46:56.017777260Z","message":"tcp-router.handle-event.starting","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.20","port":1040,"modification_tag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"615"}} {"timestamp":"2020-11-18T03:46:56.017840688Z","message":"tcp-router.handle-event.converting-tcp-route-mapping","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.20","port":1040,"modification_tag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"615","tcp-route":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.20","port":1040,"modification_tag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"ttl":10,"isolation_segment":""}}}
Correctly remove 10.0.8.20 from the routing table
{"timestamp":"2020-11-18T03:46:56.017865346Z","message":"tcp-router.routing-table.delete-backend.removing-from-table","data":{"info":{"Address":"10.0.8.20","Port":3306,"ModificationTag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"TTL":10},"key":{"Port":1040},"new":{"ModificationTag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"TTL":10,"UpdatedTime":"2020-11-18T03:46:56.017862046Z"},"old":{"ModificationTag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"TTL":10,"UpdatedTime":"2020-11-18T03:46:40.884354603Z"},"session":"2.424"}}
Correctly reconfigure the tcp-router to remove 10.0.8.20
{"timestamp":"2020-11-18T03:46:56.017896523Z","message":"tcp-router.handle-event.calling-configurer","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.20","port":1040,"modification_tag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"615"}} {"timestamp":"2020-11-18T03:46:56.017909478Z","message":"tcp-router.reading-config-file","data":{"config-file":"/var/vcap/data/tcp_router/config/haproxy.conf"}} {"timestamp":"2020-11-18T03:46:56.018122346Z","message":"tcp-router.writing-config","data":{"num-bytes":722}} {"timestamp":"2020-11-18T03:46:56.018225965Z","message":"tcp-router.running-script","data":{}} {"timestamp":"2020-11-18T03:46:56.035922948Z","message":"tcp-router.running-script","data":{"output":""}}
correctly finish delete event for 10.0.8.20
{"timestamp":"2020-11-18T03:46:56.035967937Z","message":"tcp-router.handle-event.finished","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.20","port":1040,"modification_tag":{"guid":"029ca272-120a-49a5-6383-96c1d9e5271b","index":47},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"615"}}
At this point there are two routes. One for each ip: 10.0.8.19 and 10.0.8.21
Then we get the delete event for 10.0.8.19
{"timestamp":"2020-11-18T03:46:56.072056065Z","message":"tcp-router.handle-event.starting","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.19","port":1040,"modification_tag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"618"}} {"timestamp":"2020-11-18T03:46:56.072076056Z","message":"tcp-router.handle-event.converting-tcp-route-mapping","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.19","port":1040,"modification_tag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"618","tcp-route":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.19","port":1040,"modification_tag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"ttl":10,"isolation_segment":""}}}
Correctly remove 10.0.8.19 from the routing table
{"timestamp":"2020-11-18T03:46:56.072097421Z","message":"tcp-router.routing-table.delete-backend.removing-from-table","data":{"info":{"Address":"10.0.8.19","Port":3306,"ModificationTag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"TTL":10},"key":{"Port":1040},"new":{"ModificationTag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"TTL":10,"UpdatedTime":"2020-11-18T03:46:56.072093808Z"},"old":{"ModificationTag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"TTL":10,"UpdatedTime":"2020-11-18T03:46:39.774615697Z"},"session":"2.427"}}
Correctly reconfigure the tcp-router to remove 10.0.8.19
{"timestamp":"2020-11-18T03:46:56.072117404Z","message":"tcp-router.handle-event.calling-configurer","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.19","port":1040,"modification_tag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"618"}} {"timestamp":"2020-11-18T03:46:56.072133421Z","message":"tcp-router.reading-config-file","data":{"config-file":"/var/vcap/data/tcp_router/config/haproxy.conf"}} {"timestamp":"2020-11-18T03:46:56.072273034Z","message":"tcp-router.writing-config","data":{"num-bytes":536}} {"timestamp":"2020-11-18T03:46:56.072369613Z","message":"tcp-router.running-script","data":{}} {"timestamp":"2020-11-18T03:46:56.089599918Z","message":"tcp-router.running-script","data":{"output":""}}
correctly finish delete event for 10.0.8.19
{"timestamp":"2020-11-18T03:46:56.089643035Z","message":"tcp-router.handle-event.finished","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.19","port":1040,"modification_tag":{"guid":"901c38a0-da76-47bd-755c-4dc2ca4c2550","index":73},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"618"}}
At this point the routing table has one route, and it is for the ip 10.0.8.21
Then a pruning cycle starts and finishes
{"timestamp":"2020-11-18T03:46:59.978084626Z","message":"tcp-router.prune-stale-routes.starting","data":{"session":"619"}} {"timestamp":"2020-11-18T03:46:59.978172568Z","message":"tcp-router.prune-stale-routes.completed","data":{"session":"619"}}
Then a bulk sync cycle starts
{"timestamp":"2020-11-18T03:46:59.979586070Z","message":"tcp-router.bulk-sync.starting","data":{"session":"620"}} {"timestamp":"2020-11-18T03:46:59.979623864Z","message":"tcp-router.uaa-client.started-fetching-token","data":{"endpoint":"https://uaa.service.cf.internal:8443/oauth/token","force-update":false,"session":"621"}} {"timestamp":"2020-11-18T03:46:59.979649253Z","message":"tcp-router.uaa-client.using-cached-token","data":{"session":"621"}} {"timestamp":"2020-11-18T03:46:59.979951782Z","message":"tcp-router.get-stats.start","data":{"session":"622"}} {"timestamp":"2020-11-18T03:46:59.980031711Z","message":"tcp-router.get-stats.connection-successful","data":{"session":"622"}} {"timestamp":"2020-11-18T03:46:59.980047206Z","message":"tcp-router.get-stats.sent-stats-command","data":{"session":"622"}} {"timestamp":"2020-11-18T03:46:59.980356539Z","message":"tcp-router.get-stats.num-bytes-read","data":{"count":1489,"session":"622"}} {"timestamp":"2020-11-18T03:46:59.980449271Z","message":"tcp-router.get-stats.completed","data":{"session":"622"}}
At this point the bulk sync indicates that there should be ZERO routes
{"timestamp":"2020-11-18T03:46:59.982061998Z","message":"tcp-router.bulk-sync.fetched-tcp-routes","data":{"num-routes":0,"session":"620"}} {"timestamp":"2020-11-18T03:46:59.982093986Z","message":"tcp-router.bulk-sync.applying-cached-events","data":{"cache_size":0,"session":"620"}} {"timestamp":"2020-11-18T03:46:59.982104953Z","message":"tcp-router.bulk-sync.applied-cached-events","data":{"session":"620"}}
Then a bulk sync cycle finishes WITHOUT changing the routing table or updating the haproxy config, even though it found out that is supposed to have ZERO routes.
{"timestamp":"2020-11-18T03:46:59.982113253Z","message":"tcp-router.bulk-sync.completed","data":{"session":"620"}}
Then we get the delete event for 10.0.8.21
{"timestamp":"2020-11-18T03:47:06.017581867Z","message":"tcp-router.handle-event.starting","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.21","port":1040,"modification_tag":{"guid":"5ac38837-5e52-4f4e-7cef-68b570e09a9f","index":22},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"623"}} {"timestamp":"2020-11-18T03:47:06.017631673Z","message":"tcp-router.handle-event.converting-tcp-route-mapping","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.21","port":1040,"modification_tag":{"guid":"5ac38837-5e52-4f4e-7cef-68b570e09a9f","index":22},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"623","tcp-route":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.21","port":1040,"modification_tag":{"guid":"5ac38837-5e52-4f4e-7cef-68b570e09a9f","index":22},"ttl":10,"isolation_segment":""}}}
Then the delete event for 10.0.8.21 finishes incorrectly WITHOUT changing the routing table or updating the haproxy config
{"timestamp":"2020-11-18T03:47:06.017651057Z","message":"tcp-router.handle-event.finished","data":{"event":{"TcpRouteMapping":{"router_group_guid":"0fc8cf2b-4916-4088-73c1-ec285edb812f","backend_port":3306,"backend_ip":"10.0.8.21","port":1040,"modification_tag":{"guid":"5ac38837-5e52-4f4e-7cef-68b570e09a9f","index":22},"ttl":10,"isolation_segment":""},"Action":"Delete"},"session":"623"}}