OpenTSDB / opentsdb

A scalable, distributed Time Series Database.
http://opentsdb.net
GNU Lesser General Public License v2.1
4.99k stars 1.25k forks source link

Poor resource utilisation of TSDs #2080

Open switchtrue opened 3 years ago

switchtrue commented 3 years ago

We are running OpenTSDB against Google Bigtable and are having an issue where multiple parallel requests appear to lock up the TSD to the point where it is unable to accept new requests despite available resources.

We are running the TSDs as an auto-scaling instance group on Google Cloud although I don't think this is relevant to the issue.

Specifically, what we are seeing is:

It almost feels like we are reaching a threshold of number of concurrent requests - if we have X number of long running (5-10s) queries we are simply unable to accept new requests despite having capacity to do so. I am unable to find anything in the documentation that indicates how the OpenTSDB webserver works or any tuning parameters around this.

We compiled TSDB ourselves and are simply running the TSDs with tsdb tsd and directing all traffic directly to port 4242 via a GCP load balancer. i.e. we are not running ngix or anything infront of the TSDs.

We have tsd.network.async_io set to True and have attempted to increase tsd.network.worker_threads to 4 * CPU Cores in an attempt to get the TSDs to do more work but this seems to have had no effect.

Our other thought is that we could be stuck behind stop the world garbage collection. We currently have insufficient monitoring to prove this but are working to add this in.

Does anyone have any idea as to why we might be seeing these issues? Are our assumptions possible or are we looking in the wrong place? Is there something additional we can look to tune?

Thanks, Mike

manolama commented 3 years ago

I need to contact GCP and see if I can get my Bigtable instance back. The last time I tried it out I noticed that there was some odd behavior around the Bigtable GRPC client wherein it looked like it was taking a long time (as you said 5 seconds to over a minute) with just a single test query. I think the "prod" TSDB code was still working ok with the driver that was set at the time I tried it but I haven't checked since.

Some other things to try:

switchtrue commented 3 years ago

threaddump.txt