blakelead / couchbase_exporter

Export metrics from Couchbase Server for Prometheus consumption
Other
35 stars 18 forks source link

Errors when running against couchbase 3.0.1 -- timeout exceeded while awaiting headers #20

Closed LordJeffrey closed 5 years ago

LordJeffrey commented 5 years ago

Hello,

Thanks for the help earlier -- it can run now without crashing :) One issue though: yes, it runs, but it doesn't actually get any stats when it runs now. I get these errors every time it scrapes (below). When I go to http://:8091/pools/default/buckets/presence/stats/replications I do actually see json, so the endpoints seem to be there. Any ideas? Thanks!

ERRO[0098] Get http://:8091/pools/default/buckets/mwi/stats: net/http: request canceled (Client.Timeout exceeded while awaiting headers) ERRO[0098] Get http://:8091/pools/default/buckets/presence/stats: net/http: request canceled (Client.Timeout exceeded while awaiting headers) ERRO[0098] Could not unmarshal bucketstats data for bucket mwi
LordJeffrey commented 5 years ago

Other errors involving "replications": ERRO[0098] Get http://:8091/pools/default/buckets/presence/stats/replications%2Fa67eb4ce35e01b1573cc1e2261b1d2f2%2Fpresence%2Fpresence%2Fdocs_opt_repd: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

blakelead commented 5 years ago

Hi @LordJeffrey ,

I'll treat this issue ASAP.

In the mean time, I have a question: when you say that you don't get any stats, do you mean no XDCR stats or no metrics at all ?

blakelead commented 5 years ago

I had time to actually test the exporter with Couchbase community 3.0.1 using vagrant, and unfortunately I did not reproduce your errors.

Here's what I did:

I installed Couchbase 3.0.1 on a vagrant ubuntu/trusty64 VM with Docker:

sudo docker run -d --name cb -p 8091-8094:8091-8094 -p 11210:11210 couchbase:community-3.0.1

I connected to the Couchbase node in the browser and used default configuration. I then created a remote cluster and initiated a replication between 2 of the examples buckets (beer-sample and default) to test XDCR metrics collection.

I downloaded the version 0.5.2 of the exporter and started it on my machine with the following configuration file:

web:
  listenAddress: :9191
  telemetryPath: /metrics

db:
  user: admin
  password: mypassword
  uri: http://192.168.10.10:8091

log:
  level: debug
  format: text

scrape:
  cluster: true
  node: true
  bucket: true
  xdcr: true

The logs I get when requesting the exporter are as follows:

> ./couchbase_exporter
DEBU[0000] Get http://192.168.10.10:8091/pools (6.018162ms)
INFO[0000] Couchbase version: 3.0.1-1444-rel-community
INFO[0000] Community version: true
WARN[0000] Version 3.0.1-1444-rel-community may not be supported by this exporter
DEBU[0000] /Users/aabdelhak/Projets/go/src/github.com/blakelead/couchbase_exporter/metrics/cluster-default.json loaded
DEBU[0000] Cluster exporter registered
DEBU[0000] /Users/aabdelhak/Projets/go/src/github.com/blakelead/couchbase_exporter/metrics/node-default.json loaded
DEBU[0000] Node exporter registered
DEBU[0000] /Users/aabdelhak/Projets/go/src/github.com/blakelead/couchbase_exporter/metrics/bucket-default.json loaded
DEBU[0000] Bucket exporter registered
DEBU[0000] /Users/aabdelhak/Projets/go/src/github.com/blakelead/couchbase_exporter/metrics/bucketstats-default.json loaded
DEBU[0000] Bucketstats exporter registered
DEBU[0000] /Users/aabdelhak/Projets/go/src/github.com/blakelead/couchbase_exporter/metrics/xdcr-default.json loaded
DEBU[0000] XDCR exporter registered
INFO[0000] Listening at :9191
DEBU[0004] Get http://192.168.10.10:8091/pools/default/tasks (4.702954ms)
DEBU[0004] Get http://192.168.10.10:8091/nodes/self (15.828967ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default (21.458141ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets (51.181884ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets (54.048682ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fdocs_written (59.044683ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fchanges_left (61.957526ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Frate_received_from_dcp (65.587845ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fdocs_filtered (69.969096ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fdocs_failed_cr_source (73.142982ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fbandwidth_usage (76.05548ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fdocs_rep_queue (80.47983ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fwtavg_meta_latency (82.655471ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fnum_checkpoints (85.58867ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fdata_replicated (91.066479ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fdocs_checked (100.958467ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fwtavg_docs_latency (103.820349ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Ftime_committing (106.791178ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fdocs_received_from_dcp (111.049308ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fdocs_opt_repd (113.51766ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Frate_replicated (115.636506ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fsize_rep_queue (118.916487ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats/replications%2F6735d4da4ef0f0758e89ea83f322f3a5%2Fbeer-sample%2Fdefault%2Fnum_failedckpts (123.003632ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/beer-sample/stats (114.714329ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/default/stats (119.81162ms)
DEBU[0004] Get http://192.168.10.10:8091/pools/default/buckets/gamesim-sample/stats (124.265292ms)

There is something that you could check: the exporter has a 10 second http timeout. Is there any lag in communication between the exporter and Couchbase cluster ?

LordJeffrey commented 5 years ago

For me, I wasn't getting any stats. This is very good that you tested this and got stats -- hope I didn't waste your time. I'm going to try to redo all my steps to make sure I have everything right and try again. Thanks so much!

LordJeffrey commented 5 years ago

Also, thanks for deleting my comments :)

blakelead commented 5 years ago

I'm thankful that you are using my exporter so don't worry, you're not wasting my time :)

Don't hesitate if you have more info about your issue.

LordJeffrey commented 5 years ago

Doing some testing today. It seems it IS reporting the new stats, as well as other stats, but it looks like I'm simply/actually timing out. I get this occasionally, at random: (hostname)8091/pools/default/buckets/presence/stats: net/http: request canceled (Client.Timeout exceeded while awaiting headers) ERRO[0336] Could not unmarshal bucketstats data for bucket mwi

I get this error for various different stats at random. The most common type involves "repliations": /pools/default/buckets/presence/stats/replications%2Fa67eb4ce35e01b1573cc1e2261b1d2f2%2Fpresence%2Fpresence%2Fdocs_filtered: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

I'm going to look through the documentation to see if there is a way to increase the timeout time. If there isn't, could you post a way? It could be my system is heavily used enough that I'll need more time to deliver the data.

LordJeffrey commented 5 years ago

One thing that seems to sort of work is setting this in "couchbase_exporter.go": // custom server used to set timeouts httpSrv := &http.Server{ Addr: listenAddr, ReadTimeout: 20 time.Second, WriteTimeout: 30 * time.Second, }

I say it only "sort of" works because I still get the timeout errors (after 10 seconds), but the client request I do in my browser doesn't go on forever. With the default settings, if I get the timeout alerts, the client tries again and again causing a loop of errors every 10 seconds. With these settings, it stops after 10 seconds, prints the stats it has, and yes I still get errors. Not sure how all this works, any insight appreciated.

blakelead commented 5 years ago

I'll investigate on that and you are right I should parameterize the timeouts. I'll do that ASAP!

LordJeffrey commented 5 years ago

Sweet. I'm going to be out all next week, so no rush (if you were rushing, haha). Cheers.

blakelead commented 5 years ago

Hi @LordJeffrey,

I can't reproduce the timeouts you have but I added 2 new parameters in version 0.6.0:

I hope this will solve your problem when you get back :)