hm9000.desired_state_batch_size default value too high

jsievers commented 7 years ago

We recently analyzed an HM9000 failure which turned out to be caused by timeouts in hm9000_analyzer.log like

{"timestamp":"1481728235.277132988","source":"fetcher","message":"fetcher.HTTP request failed with error","log_level":2,"data":{"error":"Get https://api.REDACTED/bulk/apps?batch_size=5000\u0026bulk_token={}: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"}}

this lead to HM9000 failure (which went unnoticed by monit or bosh)

We found that reducing the hm9000.desired_state_batch_size from 5000 to 500 works around the problem.

Measurements with curl for above URL show that the time the request takes scales linearly with the batch size (and total number of apps):

$ time curl -s -u REDACTED:REDACTED 'https://api.REDACTED/bulk/apps?batch_size=50&bulk_token=%7B%7D' -o /dev/null
real    0m1.369s

$ time curl -s -u REDACTED:REDACTED 'https://api.REDACTED/bulk/apps?batch_size=500&bulk_token=%7B%7D' -o /dev/null
real    0m10.928s

$ time curl -s -u REDACTED:REDACTED 'https://api.REDACTED/bulk/apps?batch_size=1000&bulk_token=%7B%7D' -o /dev/null
real    0m19.789s

You can see that with a bulk size of about 2000 you would exceed the 30 sec default timeout for our CF installation. E.g. if you have a CF installation growing in number of apps, at some point you hit the timeout.

We found that batch size used to be 500 but was changed for unknown reasons to 5000 with commit https://github.com/cloudfoundry/cf-release/commit/c6b784820debd24911218e89bb68b35efe7e0a81

However the same commit did not increase the timeout by the same factor of 10 (this assumes linear behaviour we saw above).

Also, the docs still state "desired_state_batch_size: The batch size when fetching desired state information from the CC. Set to 500." The default config.json also still uses batch size 500 and timeout 10s.

Should the default batch size be lowered and/or the default timeout be increased?

cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/136173353

The labels on this github issue will be updated when the story is started.

fraenkel commented 7 years ago

There should be no issue retrieving 5000 items in 30s. If it is timing out, I would look to see where your actual problem lies. Is the problem retrieving the data from the Database in which case is it misconfigured for proper usage or is your Cloud Controller undersized.

This is the first report of any issue since that change was made over 1 1/2 years ago.

jsievers commented 7 years ago

Our problem may not be related to commit c6b7848 , that was just a guess.

We checked our api VMs and they seem not overloaded w.r.t. CPU, memory, job queue size.

cloud_controller_ng.log.txt

I have attached the cloud_controller_ng.log for a single bulk/apps request (see above) i.e. filtered for a single vcap-request-id. This CF installation has about 1200 DEA apps.

You can see that

there are 69 SELECT queries for this single request
each SELECT takes about 0.3 sec on average

which sums up to roughly 20 seconds spent on the database.

questions:

is the number of 69 SELECT statements per bulk API request normal?
is the ~0.3 sec per SELECT statement considered acceptable DB performance?

fraenkel commented 7 years ago

Looks like when the models were changed in CC, they made the queries inefficient. We will investigate ways to improve this.

jsievers commented 7 years ago

FYI, we are using Cloudfoundry v247

fraenkel commented 7 years ago

I would think this problem would have surfaced after July.

jsievers commented 7 years ago

it's possible this problem has been sleeping for a while already. We may not have hit the timeout threshold for some time. The problem turned up in production and we suspect that it was the steadily growing total number of DEA apps (or some other model entity growing) that would eventually make the request execution slower than the timeout which would bring the problem to the surface.

jsievers commented 7 years ago

not sure if related but I reported a similar issue with CF v240 where the cloud controller database queries are very slow (causing out of memory in this case): https://github.com/cloudfoundry/cloud_controller_ng/issues/694

fraenkel commented 7 years ago

Could be. All of these are Cloud Controller issues when they switched to support v3.

jsievers commented 7 years ago

In the meantime we had to configure bulk size 250 and still we hit the timeout in one of our productive installations. Am I right that the only two workaround options we have are to lower hm9000.desired_state_batch_size and increase hm9000.fetcher_network_timeout_in_seconds ?

Would migrating apps to diego also work around the problem, i.e. is diego not using the bulk API which is causing the problem here?

I'm afraid that at some point we will not be able to lower bulk size/increase timeout anymore if the performance keeps degrading.

fraenkel commented 7 years ago

We have submitted a PR, https://github.com/cloudfoundry/cloud_controller_ng/pull/761, to fix 2 of the issues we have found. I will be closing this since you will need to track the PR for further resolution.

cloudfoundry-attic / hm9000

hm9000.desired_state_batch_size default value too high #22