amonapp / amonagent

Single binary agent for Linux
https://amon.cx
MIT License
48 stars 21 forks source link

Amon Agent Stops Sending Data #17

Closed jyksnw closed 7 years ago

jyksnw commented 7 years ago

I believe this might be similar the issue/request brought up on #9.

We have 2 servers that randomly stop sending data (ironically I restart them around the same time and they seem to stop sending data around the same time). When I check the amonagent.log I can see that no new entries are logged starting around the time they stop sending data.

I am still in the process of trying to debug/troubleshoot what might be causing this. None of our other servers have this issue, which is leading me to think that a) some other process on these servers are interrupting amonagent or b) something about how these servers are configured (at the OS level) is causing the problem.

Some facts:

I am going to try updating them to the latest amonagent today to see if that helps resolve the issue.

I will gladly update this issue with any additional findings or potential patches if I am able to track this down.

martinrusev commented 7 years ago

@jyksnw Can you check the log file, I think the default is set to INFO and logs every request made + a timestamp. Maybe it will be easier to debug if we know when the agent stopped sending data.

jyksnw commented 7 years ago

@martinrusev

Here are the lines of the log from one of the servers (I have obfuscated the URI and API key).

time="2017-03-21T17:44:05-04:00" level=info msg="Metrics collected (Interval:1m0s)\n" time="2017-03-21T17:44:05-04:00" level=info msg="Sending data to https://server1.example.com/api/system/v2/?api_key=XXXXXXXXXX\n" time="2017-03-22T06:25:11-04:00" level=info msg="Starting Amon Agent (Version: 0.7.2)\n" time="2017-03-22T06:25:11-04:00" level=info msg="Agent Config: Interval:1m0s\n" time="2017-03-22T06:25:19-04:00" level=info msg="Metrics collected (Interval:1m0s)\n" time="2017-03-22T06:25:20-04:00" level=info msg="Sending data to https://server1.example.com/api/system/v2/?api_key=XXXXXXXXXX\n"

jyksnw commented 7 years ago

I think I see the issue. It appears to be a two part issue:

  1. After searching the logs I found that named had a number of logged errors indicating that it couldn't resolve to our amon server.
  2. Go's http.Client defaults to a timeout of 0 which is no timeout. It appears that in the scenario above the call to SendData never returns as the client is stuck looking to complete the connection to a hostname that it can't resolve or reach.

Luckily this is easy to fix by creating the http.Client with a specified timeout. I can add this in without any issue but wanted to know if the timeout should be a configuration option or a statically set value (say 10 seconds).

I am going to create a local build to test this theory out, but after reading through our logs and looking into the http.Client request handling I am highly confident this was the cause of this issue.

jyksnw commented 7 years ago

Sorry I didn't look into how the transport was constructed before commenting. Looks like a 10 second timeout is already being utilized via the transport.

martinrusev commented 7 years ago

@jyksnw It could be a goroutine leak somewhere, although I do check for data races before releasing. One way to determine if that is the case is to monitor the memory usage.

What makes this one difficult to catch I think is that it has some parts of this bug which are hardware / distro related. I personally have 5 agents that have been running since last August

jyksnw commented 7 years ago

We have 3 other servers running with similar hardware/distro configuration that we haven't seen any issues on.

CloudFlare has en excellent writeup and graph outlining Go's client connection sequence and where each of the various timeout settings come into play.

Source - CloudFlare: The complete guide to Go net/http timeouts

So though a timeout is being set for ResponseHeaderTimeout, the request might not have reached that point and still stuck. There is a suggested Transport structure setup in the write-up that could be implemented. I will create a build for just these two servers with the suggested Transport setup and see if the issue presents itself again.

martinrusev commented 7 years ago

@jyksnw Thanks for sharing the guide. Yes, this could be the issue - the amonagent does not have a cancel request policy, just timeout

jyksnw commented 7 years ago

I have a local branch that implements a more fine grained timeout along with adding a cancel request policy that currently cancels the request after a 10 second delay. I will test this out a bit against the two servers we have been having issues with to see if it solves the problem as well as see if it introduces any other potential issues.

martinrusev commented 7 years ago

@jyksnw Cool. If it works - you can submit as a pull request and I will merge / push a new release for the agent with the fix