context deadline exceeded despite high timeout

Tom-Fawcett commented 5 years ago

I'm seeing an issue where scrapes of this exporter are failing with error context deadline exceeded.

I can't see how this is happening when:

My timeout is high.
The error occurs very soon after the scrape GET.

SQL Exporter logs

Mar 22 14:20:20 host sql_exporter[3668]: I0322 14:20:20.893037    3668 main.go:52] Starting SQL exporter (version=0.4, branch=master, revision=6dfbe982f1e42340e2a87e0d30599383c02f0bed) (go=go1.9.1, user=alin@wheat, date=2017 1128-15:26:05)
Mar 22 14:20:20 host sql_exporter[3668]: I0322 14:20:20.923174    3668 config.go:18] Loading configuration from /etc/sql_exporter/sql_exporter.yml
Mar 22 14:20:20 host sql_exporter[3668]: I0322 14:20:20.937702    3668 config.go:131] Loaded collector "foo-collector" from /etc/sql_exporter/collectors/foo-collector.collector.yml
Mar 22 14:20:20 host sql_exporter[3668]: I0322 14:20:20.937795    3668 collector.go:67] [, collector="foo-collector"] Non-zero min_interval (30s), using cached collector.
Mar 22 14:20:20 host sql_exporter[3668]: I0322 14:20:20.937849    3668 main.go:67] Listening on :9399
Mar 22 14:22:09 host sql_exporter[3668]: I0322 14:22:09.040451    3668 sql.go:88] Database handle successfully opened with driver postgres.
Mar 22 14:22:09 host sql_exporter[3668]: I0322 14:22:09.040451    3668 sql.go:88] Database handle successfully opened with driver postgres.
Mar 22 14:22:09 host sql_exporter[3668]: I0322 14:22:09.108490    3668 collector.go:124] [, collector="foo-collector"] Collecting fresh metrics: min_interval=30.000s cache_age=9223372036.855s
Mar 22 14:22:18 host sql_exporter[3668]: I0322 14:22:18.986887    3668 promhttp.go:38] Error gathering metrics: [from Gatherer #1] [, collector="foo-collector"] context deadline exceeded
Mar 22 14:22:52 host sql_exporter[3668]: I0322 14:22:52.166343    3668 promhttp.go:38] Error gathering metrics: 4 error(s) occurred:
Mar 22 14:22:52 host sql_exporter[3668]: * [from Gatherer #1] [, collector="foo-collector", query="query1"] context deadline exceeded
Mar 22 14:22:52 host sql_exporter[3668]: * [from Gatherer #1] [, collector="foo-collector", query="query2"] sql: Rows are closed
Mar 22 14:22:52 host sql_exporter[3668]: * [from Gatherer #1] [, collector="foo-collector", query="query2"] context deadline exceeded
Mar 22 14:22:52 host sql_exporter[3668]: * [from Gatherer #1] [, collector="foo-collector", query="query3"] context deadline exceeded

Postgres logs:

2019-03-22 14:22:18.779 GMT [3824] LOG:  duration: 9660.997 ms  execute 1: query4
2019-03-22 14:22:19.090 GMT [3825] LOG:  duration: 9971.390 ms  execute 1: query 1
2019-03-22 14:22:19.333 GMT [3826] LOG:  duration: 10206.832 ms  execute 1: query2
2019-03-22 14:22:52.032 GMT [3822] LOG:  duration: 42916.951 ms  execute 1: query3

SQL Exporter config:

global:
  scrape_timeout_offset: 3s
  min_interval: 30s
  max_connections: 16
  max_idle_connections: 16

Prometheus config:

  scrape_interval: 2m
  scrape_timeout: 115s

SQL exporter logs show that the database queries fail after 9 seconds. How can this happen when SQL exporter has 112 seconds to return in?

free commented 5 years ago

Ouch, that would appear to be my fault.

Initially SQL Exporter used a configuration-provided query timeout and when I found out about Prometheus' X-Prometheus-Scrape-Timeout-Seconds header (which informs targets about its scrape_timeout) I decided to use that instead (in order to allow Prometheus to control the timeout).

But while I did remove the field from the sample config, I still kept the old scrape_timeout configuration field around (in order to allow further limiting of the timeout) and (the cause of your issues) its default value of 10 seconds. So unless one specifies a global.scrape_timeout value in the configuration, the actual timeout SQL Exporter uses is:

min(prometheus_scrape_timeout - scrape_timeout_offset, 10s)

Hence the ~10s timeouts you're seeing.

The workaround is to explicitly provide a global.scrape_timeout in your sql_exporter.yml and set it to some very large value. Or at least larger than 120s, in your case.

Tom-Fawcett commented 5 years ago

Thank-you for the fast reply.

I can confirm setting global.scrape_timeout higher than Prometheus' scrape_timeout solves it.

I guess global.scrape_timeout is potentially still a useful setting to have if you want to set a master timeout exporter side. Perhaps having it default to unlimited, so Prometheus' scrape_timeout is the master would be more intuitive?

free commented 5 years ago

Perhaps having it default to unlimited, so Prometheus' scrape_timeout is the master would be more intuitive?

That's what I thought too, but it may not be the best idea ever. If something else than Prometheus queries SQL Exporter's /metrics page, it may hang forever, together with the DB connection. So I'm thinking to maybe just set it to some large value (5 minutes, as Prometheus doesn't deal well with scrape intervals over 5 minutes) and make sure to document it again (at least in the sample config).

Tom-Fawcett commented 5 years ago

Good point, that sounds reasonable

arkyhuang commented 5 years ago

@free same issue and take the WA to resolve it. So could you add below config and comments into default sql_exporter.yml in tarball? Otherwise it's easy to ignore it. Thx.

scrape_timeout: 10s

free commented 5 years ago

10 seconds is actually the default hardcoded into SQL Exporter. There's no need to put it in the YAML config.

free / sql_exporter