czerwonk / ping_exporter

Prometheus exporter for ICMP echo requests using https://github.com/digineo/go-ping
MIT License
529 stars 115 forks source link

Updating to v0.4.8 Caused Loss of Connectivity #63

Closed mari-arondeus closed 2 years ago

mari-arondeus commented 2 years ago

Hello, We're using ping_exporter in a limited capacity to test uptime to certain equipment. The software runs in a container through Docker Swarm Mode and, after the update from v0.4.7 to v0.4.8, connectivity was lost. ping_exporter did not provide metrics for the devices it was pinging, but it did continue to provide its own status. From Grafana, through Prometheus, it looked like this:

2021-12-10 Ping Exporter

The yellow and green show the ping_up metric. Green is v0.4.7 and yellow is v0.4.8. As you can see, for the time it ran on v0.4.8, the other metrics were not provided to Prometheus. The instant we swapped back to v0.4.7, these metrics began logging again. No configuration change was made between updates - the only that changed was the version number in the compose file. No log lines were generated in ping_exporter's logs.

Here's our running config:

# ping_exporter/config.yml
targets:
  - "[REDACTED IP]"
  - "[REDACTED IP]"
  - "[REDACTED IP]"
  - "[REDACTED IP]"
dns:
  refresh: 30m
  nameserver: [REDACTED IP]
ping:
  interval: 5s
  timeout: 2s
  history-size: 100
  payload-size: 10

This is obviously non-urgent as we're fine with running v0.4.7 for the time being. Please let me know if we can provide any additonal information to help you troubleshoot, and thank you for making such a handy tool.

soapiestwaffles commented 2 years ago

@cassidy3 I've been running 0.4.8 and can't reproduce this. Can you paste any additional command line parameters?

hmm, if it wasn't generating logs, it sounds like it wasn't actually starting? have you tried building from source/master ?

Oh, and can you paste your yaml manifest file for deploying it to docker swarm?

mari-arondeus commented 2 years ago

Absolutely. It was starting mind you, but it was just generating the same logs it usually does (letting us know it's running on v0.4.8). No errors, warnings, etc. I'll be onsite tomorrow so I'm happy to attempt the upgrade again in the morning and provide the output.

We don't use any additional command line parameters, so it should be using the image default (/bin/sh -c ./ping_exporter --config.path $CONFIG_FILE). Also, it's in Docker Swarm Mode, so it should have several network interfaces - local, ingress, plus the two user-provided networks (backend - isolated but contains ping targets, database - entirely isolated and can only link between prometheus and ping_exporter). It works great with this config on v0.4.7.

Anyway, here are our config files:

# ping_exporter/config.yml
targets:
  - "[REDACTED IP]"
  - "[REDACTED IP]"
  - "[REDACTED IP]"
  - "[REDACTED IP]"
dns:
  refresh: 30m
  nameserver: [REDACTED IP]
ping:
  interval: 5s
  timeout: 2s
  history-size: 100
  payload-size: 10
  # prometheus/config.yml
  scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'uptime'
    static_configs:
    - targets: ['uptime:9427']
  ...more scrape targets...
  # deploy/stack-monitoring.yml
  version: "3.8"
  services:
    prometheus:
      image: prom/prometheus:v2.31.1
      hostname: prometheus
      networks:
        - database
        - backend
      volumes:
        - prometheus_conf:/etc/prometheus
        - prometheus_data:/prometheus
      command: '--storage.tsdb.retention.time=1y --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/prometheus --web.console.libraries=/usr/share/prometheus/console_libraries --web.console.templates=/usr/share/prometheus/consoles'
      deploy:
        replicas: 1
        placement:
          constraints:
            - node.platform.os == linux
        restart_policy:
          condition: on-failure

    uptime:
      image: czerwonk/ping_exporter:v0.4.8
      hostname: uptime
      networks:
        - backend
        - database
      volumes:
        - prometheus_conf:/config:ro
      deploy:
        replicas: 1
        placement:
          constraints:
            - node.platform.os == linux
        restart_policy:
          condition: on-failure

networks:
  backend:
    external: true
  database:
    external: true

volumes:
  prometheus_conf:
    external: true
  prometheus_data:
    external: true
soapiestwaffles commented 2 years ago

@cassidy3 Hmmm, okay! I'll see if I can reproduce locally!

soapiestwaffles commented 2 years ago

@cassidy3 So I set up a very similar test set up to yours, used your config files, etc... and everything was working. I had a long write-up showing the outputs of everything and how it all worked fine, but then I realized why yours wasn't working when I looked back at your metrics graph screenshot --- you are using ping_rtt_seconds_ms{...} for your query!

I believe that metric was probably deprecated in this newer version (I do seem to recall seeing a commit about that). If we look at the /metrics output from my mock/test-setup of ping_exporter using your configs, we can see that ping_rtt_seconds_ms isn't a metric that gets returned anymore.

If you update your query and dashboards to use ping_rtt_ms instead, you should be good to go for using version >= 0.4.8!


/metrics output

# HELP ping_loss_percent Packet loss in percent
# TYPE ping_loss_percent gauge
ping_loss_percent{ip="142.250.191.78",ip_version="4",target="youtube.com"} 0
ping_loss_percent{ip="151.101.130.219",ip_version="4",target="speedtest.net"} 0
ping_loss_percent{ip="151.101.194.219",ip_version="4",target="speedtest.net"} 0
ping_loss_percent{ip="151.101.2.219",ip_version="4",target="speedtest.net"} 0
ping_loss_percent{ip="151.101.66.219",ip_version="4",target="speedtest.net"} 0
ping_loss_percent{ip="172.217.5.110",ip_version="4",target="google.com"} 0
ping_loss_percent{ip="2607:f8b0:4005:80c::200e",ip_version="6",target="google.com"} 1
ping_loss_percent{ip="2607:f8b0:4005:813::200e",ip_version="6",target="youtube.com"} 1
ping_loss_percent{ip="2a04:4e42:200::731",ip_version="6",target="speedtest.net"} 1
ping_loss_percent{ip="2a04:4e42:400::731",ip_version="6",target="speedtest.net"} 1
ping_loss_percent{ip="2a04:4e42:600::731",ip_version="6",target="speedtest.net"} 1
ping_loss_percent{ip="2a04:4e42::731",ip_version="6",target="speedtest.net"} 1
# HELP ping_rtt_best_ms Best round trip time in millis (deprecated)
# TYPE ping_rtt_best_ms gauge
ping_rtt_best_ms{ip="142.250.191.78",ip_version="4",target="youtube.com"} 1.6027560234069824
ping_rtt_best_ms{ip="151.101.130.219",ip_version="4",target="speedtest.net"} 1.6907709836959839
ping_rtt_best_ms{ip="151.101.194.219",ip_version="4",target="speedtest.net"} 0.9022539854049683
ping_rtt_best_ms{ip="151.101.2.219",ip_version="4",target="speedtest.net"} 0.8461319804191589
ping_rtt_best_ms{ip="151.101.66.219",ip_version="4",target="speedtest.net"} 0.8880199790000916
ping_rtt_best_ms{ip="172.217.5.110",ip_version="4",target="google.com"} 1.6140429973602295
# HELP ping_rtt_mean_ms Mean round trip time in millis (deprecated)
# TYPE ping_rtt_mean_ms gauge
ping_rtt_mean_ms{ip="142.250.191.78",ip_version="4",target="youtube.com"} 1.8511054515838623
ping_rtt_mean_ms{ip="151.101.130.219",ip_version="4",target="speedtest.net"} 1.9293391704559326
ping_rtt_mean_ms{ip="151.101.194.219",ip_version="4",target="speedtest.net"} 1.7410287857055664
ping_rtt_mean_ms{ip="151.101.2.219",ip_version="4",target="speedtest.net"} 2.418238878250122
ping_rtt_mean_ms{ip="151.101.66.219",ip_version="4",target="speedtest.net"} 9.6593599319458
ping_rtt_mean_ms{ip="172.217.5.110",ip_version="4",target="google.com"} 1.756668210029602
# HELP ping_rtt_ms Round trip time in millis (deprecated)
# TYPE ping_rtt_ms gauge
ping_rtt_ms{ip="142.250.191.78",ip_version="4",target="youtube.com",type="best"} 1.6027560234069824
ping_rtt_ms{ip="142.250.191.78",ip_version="4",target="youtube.com",type="mean"} 1.8511054515838623
ping_rtt_ms{ip="142.250.191.78",ip_version="4",target="youtube.com",type="std_dev"} 0.40969622135162354
ping_rtt_ms{ip="142.250.191.78",ip_version="4",target="youtube.com",type="worst"} 5.528079986572266
ping_rtt_ms{ip="151.101.130.219",ip_version="4",target="speedtest.net",type="best"} 1.6907709836959839
ping_rtt_ms{ip="151.101.130.219",ip_version="4",target="speedtest.net",type="mean"} 1.9293391704559326
ping_rtt_ms{ip="151.101.130.219",ip_version="4",target="speedtest.net",type="std_dev"} 0.46504613757133484
ping_rtt_ms{ip="151.101.130.219",ip_version="4",target="speedtest.net",type="worst"} 5.709164142608643
ping_rtt_ms{ip="151.101.194.219",ip_version="4",target="speedtest.net",type="best"} 0.9022539854049683
ping_rtt_ms{ip="151.101.194.219",ip_version="4",target="speedtest.net",type="mean"} 1.7410287857055664
ping_rtt_ms{ip="151.101.194.219",ip_version="4",target="speedtest.net",type="std_dev"} 4.0132222175598145
ping_rtt_ms{ip="151.101.194.219",ip_version="4",target="speedtest.net",type="worst"} 33.03122329711914
ping_rtt_ms{ip="151.101.2.219",ip_version="4",target="speedtest.net",type="best"} 0.8461319804191589
ping_rtt_ms{ip="151.101.2.219",ip_version="4",target="speedtest.net",type="mean"} 2.418238878250122
ping_rtt_ms{ip="151.101.2.219",ip_version="4",target="speedtest.net",type="std_dev"} 6.5633063316345215
ping_rtt_ms{ip="151.101.2.219",ip_version="4",target="speedtest.net",type="worst"} 45.34522247314453
ping_rtt_ms{ip="151.101.66.219",ip_version="4",target="speedtest.net",type="best"} 0.8880199790000916
ping_rtt_ms{ip="151.101.66.219",ip_version="4",target="speedtest.net",type="mean"} 9.6593599319458
ping_rtt_ms{ip="151.101.66.219",ip_version="4",target="speedtest.net",type="std_dev"} 54.366641998291016
ping_rtt_ms{ip="151.101.66.219",ip_version="4",target="speedtest.net",type="worst"} 475.7384948730469
ping_rtt_ms{ip="172.217.5.110",ip_version="4",target="google.com",type="best"} 1.6140429973602295
ping_rtt_ms{ip="172.217.5.110",ip_version="4",target="google.com",type="mean"} 1.756668210029602
ping_rtt_ms{ip="172.217.5.110",ip_version="4",target="google.com",type="std_dev"} 0.08481151610612869
ping_rtt_ms{ip="172.217.5.110",ip_version="4",target="google.com",type="worst"} 2.3685851097106934
# HELP ping_rtt_std_deviation_ms Standard deviation in millis (deprecated)
# TYPE ping_rtt_std_deviation_ms gauge
ping_rtt_std_deviation_ms{ip="142.250.191.78",ip_version="4",target="youtube.com"} 0.40969622135162354
ping_rtt_std_deviation_ms{ip="151.101.130.219",ip_version="4",target="speedtest.net"} 0.46504613757133484
ping_rtt_std_deviation_ms{ip="151.101.194.219",ip_version="4",target="speedtest.net"} 4.0132222175598145
ping_rtt_std_deviation_ms{ip="151.101.2.219",ip_version="4",target="speedtest.net"} 6.5633063316345215
ping_rtt_std_deviation_ms{ip="151.101.66.219",ip_version="4",target="speedtest.net"} 54.366641998291016
ping_rtt_std_deviation_ms{ip="172.217.5.110",ip_version="4",target="google.com"} 0.08481151610612869
# HELP ping_rtt_worst_ms Worst round trip time in millis (deprecated)
# TYPE ping_rtt_worst_ms gauge
ping_rtt_worst_ms{ip="142.250.191.78",ip_version="4",target="youtube.com"} 5.528079986572266
ping_rtt_worst_ms{ip="151.101.130.219",ip_version="4",target="speedtest.net"} 5.709164142608643
ping_rtt_worst_ms{ip="151.101.194.219",ip_version="4",target="speedtest.net"} 33.03122329711914
ping_rtt_worst_ms{ip="151.101.2.219",ip_version="4",target="speedtest.net"} 45.34522247314453
ping_rtt_worst_ms{ip="151.101.66.219",ip_version="4",target="speedtest.net"} 475.7384948730469
ping_rtt_worst_ms{ip="172.217.5.110",ip_version="4",target="google.com"} 2.3685851097106934
# HELP ping_up ping_exporter version
# TYPE ping_up gauge
ping_up{version="0.4.8"} 1

prometheus output using ping_rtt_ms

prom_ping_rtt_ms

mari-arondeus commented 2 years ago

Aargh, how did I miss that! I think I checked that metric and got impatient before it completed its first ping. Anywho, thanks so much for taking the time to help me troubleshoot. Everything appears to be working great now.