Open switchtrue opened 3 years ago
I need to contact GCP and see if I can get my Bigtable instance back. The last time I tried it out I noticed that there was some odd behavior around the Bigtable GRPC client wherein it looked like it was taking a long time (as you said 5 seconds to over a minute) with just a single test query. I think the "prod" TSDB code was still working ok with the driver that was set at the time I tried it but I haven't checked since.
Some other things to try:
tsd.network.worker_threads
although I have not pushed the latter aggressively. tsd.mode
set to rw
although it's not clear if changing this would lead to any performance benefit.
We are running OpenTSDB against Google Bigtable and are having an issue where multiple parallel requests appear to lock up the TSD to the point where it is unable to accept new requests despite available resources.
We are running the TSDs as an auto-scaling instance group on Google Cloud although I don't think this is relevant to the issue.
Specifically, what we are seeing is:
It almost feels like we are reaching a threshold of number of concurrent requests - if we have X number of long running (5-10s) queries we are simply unable to accept new requests despite having capacity to do so. I am unable to find anything in the documentation that indicates how the OpenTSDB webserver works or any tuning parameters around this.
We compiled TSDB ourselves and are simply running the TSDs with
tsdb tsd
and directing all traffic directly to port 4242 via a GCP load balancer. i.e. we are not running ngix or anything infront of the TSDs.We have
tsd.network.async_io
set toTrue
and have attempted to increasetsd.network.worker_threads
to4 * CPU Cores
in an attempt to get the TSDs to do more work but this seems to have had no effect.Our other thought is that we could be stuck behind stop the world garbage collection. We currently have insufficient monitoring to prove this but are working to add this in.
Does anyone have any idea as to why we might be seeing these issues? Are our assumptions possible or are we looking in the wrong place? Is there something additional we can look to tune?
Thanks, Mike