SuperQ / chrony_exporter

Exporter for Chrony NTP
Apache License 2.0
54 stars 14 forks source link

Socket: i/o timeout and bind: already in use #103

Open spaullilly opened 1 week ago

spaullilly commented 1 week ago

I switched to running the exporter in the same container as the chronyd server. I'm scraping it every 60s.

What I am seeing:

time=2024-11-04T18:43:46.590Z level=DEBUG source=collector.go:149 msg="Scrape completed" scrape_id=40 seconds=0.004325332
time=2024-11-04T18:44:21.389Z level=DEBUG source=collector.go:146 msg="Scrape starting" scrape_id=41
time=2024-11-04T18:44:21.389Z level=DEBUG source=collector.go:146 msg="Scrape starting" scrape_id=42
time=2024-11-04T18:44:21.389Z level=DEBUG source=collector.go:155 msg="Couldn't connect to chrony" scrape_id=42 address=unix:///run/chrony/chronyd.sock err="dial unixgram /run/chrony/chrony_exporter.7.sock->/run/chrony/chronyd.sock: bind: address already in use"
time=2024-11-04T18:44:21.389Z level=DEBUG source=collector.go:149 msg="Scrape completed" scrape_id=42 seconds=9.6575e-05
time=2024-11-04T18:44:21.389Z level=DEBUG source=tracking.go:176 msg="Got 'tracking' response" scrape_id=41 tracking_packet=SUCCESS
time=2024-11-04T18:44:21.390Z level=DEBUG source=collector.go:191 msg="DNS lookup took" scrape_id=41 seconds=0.000999318
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:186 msg="Tracking Last Offset" scrape_id=41 offset=0.00017267093062400818
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:189 msg="Tracking Ref Time" scrape_id=41 ref_time=2024-11-04T18:43:30.632Z
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:192 msg="Tracking System Time" scrape_id=41 system_time=-0.0014299183385446668
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:199 msg="Tracking is remote" scrape_id=41 bool_value=1
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:202 msg="Tracking RMS Offset" scrape_id=41 rms_offset=0.0017726216465234756
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:205 msg="Tracking Root delay" scrape_id=41 root_delay=0.0168450977653265
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:208 msg="Tracking Root dispersion" scrape_id=41 root_dispersion=0.0026964773423969746
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:211 msg="Tracking Frequency" scrape_id=41 frequency=-0.6470763087272644
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:214 msg="Tracking Residual Frequency" scrape_id=41 residual_frequency=0.017963755875825882
time=2024-11-04T18:44:21.391Z level=DEBUG source=tracking.go:217 msg="Tracking Skew" scrape_id=41 skew=1.2864940166473389
time=2024-11-04T18:44:21.391Z level=DEBUG source=tracking.go:220 msg="Tracking Last Update Interval" scrape_id=41 update_interval=65.0595703125
time=2024-11-04T18:44:21.391Z level=DEBUG source=tracking.go:223 msg="Tracking Stratum" scrape_id=41 stratum=2
time=2024-11-04T18:44:26.393Z level=DEBUG source=collector.go:182 msg="Couldn't get serverstats" scrape_id=41 err="read unixgram /run/chrony/chrony_exporter.7.sock->/run/chrony/chronyd.sock: i/o timeout"
time=2024-11-04T18:44:26.394Z level=DEBUG source=collector.go:149 msg="Scrape completed" scrape_id=41 seconds=5.005050362
time=2024-11-04T18:44:46.586Z level=DEBUG source=collector.go:146 msg="Scrape starting" scrape_id=43

It works sometimes though. Any idea what could be the issue?

spaullilly commented 1 week ago

Important note:

/usr/bin/chrony_exporter --chrony.address="unix:///run/chrony/chronyd.sock" --log.level=debug --collector.serverstats --collector.chmod-socket
spaullilly commented 1 week ago

I also set the timeout to 30s and still getting the same errors

spaullilly commented 1 week ago

I figured out what was going on with the bind: already in use issue. No need to look further here. Though a caching layer would be helpful here. We have 2 scrapers running and the 2nd request fails with this.

But the I/O is happening randomly (more persistently). Some calls work within a second, while others will get tracking and then timeout on getting serverstats and some will simple fail to get both within the timeout. I have the timeout set to 30s (previously 5s) and at that duration completion, it shows this error.

Maybe this is a issue with chronyd resources?

SuperQ commented 1 week ago

I guess the way the unix socket works, you can only make one connection at a time.

We could change the unix socket code to make one shared connection.