Closed KesavanKing closed 4 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/172529992
The labels on this github issue will be updated when the story is started.
Our scalability tests details can be found in https://github.com/perf-cfk8s/docs/wiki/cf-for-k8s-v0.1.0
@KesavanKing thanks for sharing this! I have a few questions.
Were the unavailable routes for newly pushed apps? Were they available at one point and became unavailable or did they never become available to begin with?
Also how healthy were the Cloud Controller APIs at the point when you started seeing routes become unavailable? Based on the link you shared above, it sounds like they were starting to return 500
s. If this is the case, I have a feeling that our new design involving the Route
CRD and RouteController
may help some here.
The current (cfroutesync
) implementation was good for getting the MVP established, but starts to have issues at scale since it is making expensive requests to Cloud Controller every few seconds to fetch information about all routes in the system. I've seen those requests take upwards of 3 seconds in cf-for-k8s environments with > 1000 routes/apps and if the Cloud Controllers are unhealthy it's possible it may not succeed so we wouldn't discover new routes.
We touched on this a bit in this issue: https://github.com/cloudfoundry/capi-k8s-release/issues/17
Our new Route
CRD / RouteController
approach should help this by enabling Cloud Controller to send individual route updates directly to the Kubernetes API as changes come in. In other words, a cf map-route
will cause CAPI to update a Route
resource on k8s that we can act on immediately.
Once we have the RouteController
work completed and into a release of cf-for-k8s
it would be awesome if y'all could run these tests again!
@tcdowney Thanks for the response.
Route was available after sometime. We had a timeout set to 5 mins to break the loop and continue next iteration. When I checked after around ~10 mins route was available. CAPI was definitely healthy. Overall its an inconsistent experience for pushing the apps.
We are aware the new route CRD implementation. Looks promising at a high scale.We are happy to re-run once the RouteController is completed.
hi @KesavanKing,
cf-for-k8s now has our new implementation using RouteController and the new Route CRD. Would you like to re-run your test and post back your results?
Hi @christianang
Thanks for reaching out us again. We performed the tests with routecontroller implementation from cf-for-k8s. Now we are not facing this issue of route unavailability. We were able to successfully deploy around 1000 applications.
We are facing two new issues.
Lot of apps gets "Staging time Expired" #140 which is already reported
Failed to create/update/delete Route resource with guid 'f4cc6bd0-4242-4fa1-bc88-57ea8814049c' on Kubernetes\", \"error_code\"=>\"CF-KubernetesRouteResourceError
from cf-api-server.
{"timestamp":1592899641.142837,"message":"Failed to Update Route CRD: HTTP status code 409, Operation cannot be fulfilled on routes.networking.cloudfoundry.org \"c0e42e0d-e6cc-4cce-b008-b8e976be6dea\": the object has been modified; please apply your changes to the latest version and try again for PUT https://kubernetes.default/apis/networking.cloudfoundry.org/v1alpha1/namespaces/cf-workloads/routes/c0e42e0d-e6cc-4cce-b008-b8e976be6dea","log_level":"info","source":"cc.action.route_update","data":{"request_guid":"257c79aa-76f1-4a88-a3a7-7f91c6fdc1f2::8e837f91-8cb2-45ed-ab06-54c40e3221a0"},"thread_id":47384772400140,"fiber_id":47384772089500,"process_id":1,"file":"/cloud_controller_ng/lib/kubernetes/route_crd_client.rb","lineno":53,"method":"rescue in update_destinations"} {"timestamp":1592899641.1436174,"message":"Request failed: 422: {\"description\"=>\"Failed to create/update/delete Route resource with guid 'c0e42e0d-e6cc-4cce-b008-b8e976be6dea' on Kubernetes\", \"error_code\"=>\"CF-KubernetesRouteResourceError\", \"code\"=>400001, \"test_mode_info\"=>{\"description\"=>\"Failed to create/update/delete Route resource with guid 'c0e42e0d-e6cc-4cce-b008-b8e976be6dea' on Kubernetes\", \"error_code\"=>\"CF-KubernetesRouteResourceError\", \"backtrace\"=>[\"/usr/local/lib/ruby/gems/2.5.0/gems/kubeclient-4.5.0/lib/kubeclient/common.rb:130:in `rescue in handle_exception'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/kubeclient-4.5.0/lib/kubeclient/common.rb:120:in `handle_exception'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/kubeclient-4.5.0/lib/kubeclient/common.rb:391:in `update_entity'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/kubeclient-4.5.0/lib/kubeclient/common.rb:240:in `block (2 levels) in define_entity_methods'\", \"/cloud_controller_ng/lib/kubernetes/route_crd_client.rb:50:in `update_destinations'\", \"/cloud_controller_ng/app/actions/v2/route_mapping_create.rb:52:in `add'\", \"/cloud_controller_ng/app/controllers/runtime/routes_controller.rb:262:in `add_app'\", \"/cloud_controller_ng/app/controllers/base/base_controller.rb:84:in `dispatch'\", \"/cloud_controller_ng/lib/cloud_controller/rest_controller/routes.rb:16:in `block in define_route'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1634:in `call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1634:in `block in compile!'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:992:in `block (3 levels) in route!'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1011:in `route_eval'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:992:in `block (2 levels) in route!'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1040:in `block in process_route'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1038:in `catch'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1038:in `process_route'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:990:in `block in route!'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:989:in `each'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:989:in `route!'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1097:in `block in dispatch!'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1076:in `block in invoke'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1076:in `catch'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1076:in `invoke'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1094:in `dispatch!'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:924:in `block in call!'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1076:in `block in invoke'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1076:in `catch'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1076:in `invoke'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:924:in `call!'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:913:in `call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/rack-protection-2.0.5/lib/rack/protection/xss_header.rb:18:in `call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/rack-protection-2.0.5/lib/rack/protection/path_traversal.rb:16:in `call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/rack-protection-2.0.5/lib/rack/protection/json_csrf.rb:26:in `call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/rack-protection-2.0.5/lib/rack/protection/base.rb:50:in `call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/rack-protection-2.0.5/lib/rack/protection/base.rb:50:in `call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/rack-protection-2.0.5/lib/rack/protection/frame_options.rb:31:in `call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/rack-2.2.2/lib/rack/null_logger.rb:11:in `call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/rack-2.2.2/lib/rack/head.rb:12:in `call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:194:in `call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/sinatra-2.0.5/lib/sinatra/base.rb:1957:in `call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/rack-2.2.2/lib/rack/urlmap.rb:74:in `block in call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/rack-2.2.2/lib/rack/urlmap.rb:58:in `each'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/rack-2.2.2/lib/rack/urlmap.rb:58:in `call'\", \"/cloud_controller_ng/middleware/request_logs.rb:38:in `call'\", \"/cloud_controller_ng/middleware/security_context_setter.rb:19:in `call'\", \"/cloud_controller_ng/middleware/vcap_request_id.rb:15:in `call'\", \"/cloud_controller_ng/middleware/cors.rb:49:in `call_app'\", \"/cloud_controller_ng/middleware/cors.rb:14:in `call'\", \"/cloud_controller_ng/middleware/request_metrics.rb:12:in `call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/rack-2.2.2/lib/rack/builder.rb:244:in `call'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/thin-1.7.2/lib/thin/connection.rb:86:in `block in pre_process'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/thin-1.7.2/lib/thin/connection.rb:84:in `catch'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/thin-1.7.2/lib/thin/connection.rb:84:in `pre_process'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/thin-1.7.2/lib/thin/connection.rb:50:in `block in process'\", \"/usr/local/lib/ruby/gems/2.5.0/gems/eventmachine-1.0.9.1/lib/eventmachine.rb:1067:in `block in spawn_threadpool'\"]}}","log_level":"info","source":"cc.api","data":{"request_guid":"257c79aa-76f1-4a88-a3a7-7f91c6fdc1f2::8e837f91-8cb2-45ed-ab06-54c40e3221a0"},"thread_id":47384772400140,"fiber_id":47384772089500,"process_id":1,"file":"/cloud_controller_ng/lib/sinatra/vcap.rb","lineno":44,"method":"block in registered"}
When the check, relevant route CRD is created but CAPI reports this. If you are sure that it is relevant to capi , we will raise an issue over there.
It is CAPI failing, but it isn't clear why CAPI can't delete the route from the log message. Before reaching out to CAPI, I am curious to know, when this error occurs is it a transient error? Are you able to eventually delete the route if you try again? Another theory I have is, we had a previous issue where the finalizer wasn't being removed, which prevented the route from being completely deleted. Not sure if CAPI actually waits for finalizers to be removed or not, but I'd be curious to know if you see any errors in the routecontroller
pod or if you describe the route that is being failed to delete and see if there is a deletionTimestamp
set, but the finalizer still exists.
There were no error logs in routecontroller
. And we are sure that the error message is not for deleting the route. We got this error from CAPI when we push the apps. The Route object creation was succeeded. I am able to get the route object with the id but the app was not deployed.
I see. It may be worth asking CAPI.
Sure. I'll raise the issue in CAPI. I am closing this issue as the routecontroller implementation solved it. Thanks for the nice work and support.
We deployed cf-for-k8s and performed a scalability tests. We pushed source code based apps concurrently (10) at a time. To check on the push , start and route availability time we designed a stack which will emit those metrics. During the course of tests at about reaching 600 apps we could see some routes were unavailable even after waiting 5 minutes. (We have set the script to check for the route for 5 mins and timeout and continue next push).
From above image you could see continues dips in routes after 19:00 even though app's staging completed and went to started state