Poor resource utilisation of TSDs

switchtrue commented 3 years ago

We are running OpenTSDB against Google Bigtable and are having an issue where multiple parallel requests appear to lock up the TSD to the point where it is unable to accept new requests despite available resources.

We are running the TSDs as an auto-scaling instance group on Google Cloud although I don't think this is relevant to the issue.

Specifically, what we are seeing is:

Low CPU (~10%) and memory usage on the TSDs
40% load on Bigtable with reasonable (well within expected bounds) read/writes
Low SSD utilisation on Bigtable
Highly inconsistent latencies on the same TSDB queries with the same responses (5 to 70 seconds)
Failing health checks (timeouts on /api/stats) despite all of the above
Increasing the CPU count per TSD improves the issue despite low CPU utilisation

It almost feels like we are reaching a threshold of number of concurrent requests - if we have X number of long running (5-10s) queries we are simply unable to accept new requests despite having capacity to do so. I am unable to find anything in the documentation that indicates how the OpenTSDB webserver works or any tuning parameters around this.

We compiled TSDB ourselves and are simply running the TSDs with tsdb tsd and directing all traffic directly to port 4242 via a GCP load balancer. i.e. we are not running ngix or anything infront of the TSDs.

We have tsd.network.async_io set to True and have attempted to increase tsd.network.worker_threads to 4 * CPU Cores in an attempt to get the TSDs to do more work but this seems to have had no effect.

Our other thought is that we could be stuck behind stop the world garbage collection. We currently have insufficient monitoring to prove this but are working to add this in.

Does anyone have any idea as to why we might be seeing these issues? Are our assumptions possible or are we looking in the wrong place? Is there something additional we can look to tune?

Thanks, Mike

manolama commented 3 years ago

I need to contact GCP and see if I can get my Bigtable instance back. The last time I tried it out I noticed that there was some odd behavior around the Bigtable GRPC client wherein it looked like it was taking a long time (as you said 5 seconds to over a minute) with just a single test query. I think the "prod" TSDB code was still working ok with the driver that was set at the time I tried it but I haven't checked since.

Some other things to try:

If you standup multiple TSD instances behind a load balancer does the situation improve? It could because there is likely some thread starvation somewhere that's happening in a single TSD (would explain the low CPU utilization).
Are you querying through the same TSD that's writing data? If so, do the writes drop during that period?
The failing stats checks would be due to being unable to reach Bigtable as that call checks the UID assignment row.
Could you capture a thread dump when the queries are stalled please? I'm guessing they're all waiting for data or one or two may be looping on something.

switchtrue commented 3 years ago

Increasing the number of TSDs doesn't seem to have much impact on the situation. Similarly, neither does changing the number of workers via tsd.network.worker_threads although I have not pushed the latter aggressively.
We have a separate cluster of TSDs for writes - the write cluster seems unaffected. The read cluster does have tsd.mode set to rw although it's not clear if changing this would lead to any performance benefit.
It's difficult to catch it in the act but I have included a thread dump (below) from a period where it's not performing at its best.

threaddump.txt

OpenTSDB / opentsdb

Poor resource utilisation of TSDs #2080