digineo / go-ping

A simple ping library using ICMP echo requests.
MIT License
234 stars 32 forks source link

Monitor, not pinging hosts, unstable results #17

Open STiAT opened 3 years ago

STiAT commented 3 years ago

Hi,

I've used your example code to add 593 of our hosts to the monitor tool, since it basically should do what I need (I need a statistics every 60 seconds which I would want to log away).

The monitor reports the following: 2020/12/03 14:14:17 hanco: {PacketsSent:10 PacketsLost:10 Best:0 Worst:0 Median:NaN Mean:NaN StdDev:NaN}

Which is certainly the host and it can be pinged: image

It is certainly true that several of the hosts are currently "down" because behind Firewalls and I'm not testing in the same zone, but I'd expect it to report the loss at the "right" hosts ... and all 593 instead of just 255 ofc ;-).

Current code, which pretty much reflects the example with some little changes:

package slaping

import (
    "net"
    "os"
    "time"
    "log"
    "os/signal"
    "syscall"

    "github.com/digineo/go-ping"
    "github.com/digineo/go-ping/monitor"
)

var (
    pingInterval        = 5 * time.Second
    pingTimeout         = 4 * time.Second
    reportInterval      = 60 * time.Second
    size           uint = 56
    pinger         *ping.Pinger
    targets        []string
)

func (p *SLAPing) PingHosts() {
    // setup, we need to populate cache once before we start to actually have host and maint info
    CMDBHosts, err := p.CMDB.GetSLAHosts()

    if err != nil {
        log.Println(err)
    }

    // Bind to sockets
    if pi, err := ping.New("0.0.0.0", "::"); err != nil {
        log.Printf("Unable to bind: %s\nRunning as root?\n", err)
        os.Exit(2)
    } else {
        pinger = pi
    }
    pinger.SetPayloadSize(uint16(size))
    defer pinger.Close()

    // Create monitor
    monitor := monitor.New(pinger, pingInterval, pingTimeout)
    defer monitor.Stop()

    // Add targets
    targets := CMDBHosts
    validtargets := 0
    for i, target := range CMDBHosts {
        ipAddr, err := net.ResolveIPAddr("", target)
        if err != nil {
            // currently irrelevant
            continue
        }

        err = monitor.AddTarget(string([]byte{byte(i)}), *ipAddr)
        //err = monitor.AddTargetDelayed(string([]byte{byte(i)}), *ipAddr, 10*time.Millisecond*time.Duration(i))

        if err != nil {
            // currently irrelevant
            continue
        }
        validtargets = validtargets + 1
    }

    log.Printf("Total targets: %d, Valid targets: %d, Errors: %d\n", len(targets), validtargets, len(targets)-validtargets)

    // Start report routine
    ticker := time.NewTicker(reportInterval)
    defer ticker.Stop()
    go func() {
        for range ticker.C {
            for i, metrics := range monitor.ExportAndClear() {
                log.Printf("%s: %+v\n", targets[[]byte(i)[0]], *metrics)
            }
        }
    }()

    // Handle SIGINT and SIGTERM.
    // in other words we wait here until interrupted and let the ticker tick
    ch := make(chan os.Signal, 1)
    signal.Notify(ch, syscall.SIGINT, syscall.SIGTERM)
    log.Println("received", <-ch)
}

[edit] My initial thought was that it may be connected to the ICMP pings getting blocked by firewalls and threads are getting locked up due to the timeouts not properly working there, since it's an uncommon case and may not be that well tested. Now, due to the result "shuffle" I'm not sure but I have systems in the results which actually are blocked by firewalls, going through the slice sequentially that could explain why it's always the same number, but I didn't see any max threads number to get locked in the library, is there a limit where the channel / parallel work of the library blocking?

[edit2] I could test it now on a host which has the necessary firewall permissions to actually run this without getting blocked, the results are still the same. Something is wrong here, just can't make out what.

STiAT commented 3 years ago

Any idea on this issue?

corny commented 3 years ago

Not really, but please try the rewrite-monitor branch.

STiAT commented 3 years ago

Hey there, thanks for the answer.

Rewrote the code for now,, but it seems to "stall" overall. The issue is I can't find a way to set a timeout, and having 300/600 hosts behind a firewall it probably will wait forever.

In the monitor, AddTarget/AddTargetDelayed seeems to use p.timeout, but I don't see a way to set it, maybe I just overlooked that and it's there ( I really couldn't see it in that branch, if it's there sorry). But to test that I need a way to set the timeout.