datalust / seq-forwarder

Local collection and reliable forwarding of log data to Seq
Apache License 2.0
57 stars 15 forks source link

Logs Not Being Forwarded For Some Applications #37

Closed danjohnso closed 6 years ago

danjohnso commented 6 years ago

Sorry to put this out here, but my work email is not letting me unblock your support email account right now.

Server 1: Apps, log to forwarder on Server 2 Server 2: Forwarder host, also apps that log directly to the forwarder on this same server ----------------Firewall (Port specific rule allows seq traffic from Server 2 to Server 3)-------------- Server 3: Seq, also some apps that log directly to seq

We are still seeing random applications that the forwarder refuses to send the logs for, even with changing out the API key now. Some applications have no trouble All of our applications have API keys assigned and are using them, but we still get:

{"@t":"2017-12-28T18:06:59.4992343Z","@m":"Seq Forwarder listening on \"http://server2:15341\"","@i":"bc80c4c1","ListenUri":"http://server2:15341"}
{"@t":"2017-12-28T18:06:59.5148703Z","@m":"Loading an API-key specific buffer in \"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\4a0f7de710e7428b95229416784c33a9\"","@i":"6a905877","Path":"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\4a0f7de710e7428b95229416784c33a9","SourceContext":"Seq.Forwarder.Multiplexing.ActiveLogBufferMap"}
{"@t":"2017-12-28T18:06:59.5304805Z","@m":"Log buffer open on \"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\4a0f7de710e7428b95229416784c33a9\"; 120 entries, next key will be 121","@i":"505684b8","BufferPath":"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\4a0f7de710e7428b95229416784c33a9","Entries":120,"NextId":121}
{"@t":"2017-12-28T18:06:59.5461117Z","@m":"Loading an API-key specific buffer in \"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\9119b25d08b147458566edd73c118187\"","@i":"6a905877","Path":"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\9119b25d08b147458566edd73c118187","SourceContext":"Seq.Forwarder.Multiplexing.ActiveLogBufferMap"}
{"@t":"2017-12-28T18:06:59.5461117Z","@m":"Log buffer open on \"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\9119b25d08b147458566edd73c118187\"; 59 entries, next key will be 60","@i":"505684b8","BufferPath":"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\9119b25d08b147458566edd73c118187","Entries":59,"NextId":60}
{"@t":"2017-12-28T18:06:59.5461117Z","@m":"Loading an API-key specific buffer in \"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\b128d10f88734e81aad0b89721779cf5\"","@i":"6a905877","Path":"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\b128d10f88734e81aad0b89721779cf5","SourceContext":"Seq.Forwarder.Multiplexing.ActiveLogBufferMap"}
{"@t":"2017-12-28T18:06:59.5461117Z","@m":"Log buffer open on \"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\b128d10f88734e81aad0b89721779cf5\"; 174 entries, next key will be 175","@i":"505684b8","BufferPath":"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\b128d10f88734e81aad0b89721779cf5","Entries":174,"NextId":175}
{"@t":"2017-12-28T18:06:59.5461117Z","@m":"Loading an API-key specific buffer in \"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\ead51287febf4670a41275723beccbeb\"","@i":"6a905877","Path":"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\ead51287febf4670a41275723beccbeb","SourceContext":"Seq.Forwarder.Multiplexing.ActiveLogBufferMap"}
{"@t":"2017-12-28T18:06:59.5617318Z","@m":"Log buffer open on \"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\ead51287febf4670a41275723beccbeb\"; 1 entries, next key will be 2","@i":"505684b8","BufferPath":"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\ead51287febf4670a41275723beccbeb","Entries":1,"NextId":2}
{"@t":"2017-12-28T18:06:59.5617318Z","@m":"Log shipper started, events will be dispatched to \"http://server3:45003\"","@i":"8bd6415c","ServerUrl":"http://server3:45003"}
{"@t":"2017-12-28T18:06:59.5617318Z","@m":"Log shipper started, events will be dispatched to \"http://server3:45003\"","@i":"8bd6415c","ServerUrl":"http://server3:45003"}
{"@t":"2017-12-28T18:06:59.5617318Z","@m":"Log shipper started, events will be dispatched to \"http://server3:45003\"","@i":"8bd6415c","ServerUrl":"http://server3:45003"}
{"@t":"2017-12-28T18:06:59.5617318Z","@m":"Log shipper started, events will be dispatched to \"http://server3:45003\"","@i":"8bd6415c","ServerUrl":"http://server3:45003"}
{"@t":"2017-12-28T18:07:03.9060772Z","@m":"Received failed HTTP shipping result Forbidden: \"{\\\"Error\\\":\\\"An API key is required for writing events.\\\"}\"","@i":"d3c0dc9f","@l":"Error","StatusCode":"Forbidden","Result":"{\"Error\":\"An API key is required for writing events.\"}"}
{"@t":"2017-12-28T18:07:03.9216993Z","@m":"Received failed HTTP shipping result Forbidden: \"{\\\"Error\\\":\\\"An API key is required for writing events.\\\"}\"","@i":"d3c0dc9f","@l":"Error","StatusCode":"Forbidden","Result":"{\"Error\":\"An API key is required for writing events.\"}"}
{"@t":"2017-12-28T18:07:05.9536434Z","@m":"Received failed HTTP shipping result Forbidden: \"{\\\"Error\\\":\\\"An API key is required for writing events.\\\"}\"","@i":"d3c0dc9f","@l":"Error","StatusCode":"Forbidden","Result":"{\"Error\":\"An API key is required for writing events.\"}"}
{"@t":"2017-12-28T18:07:05.9536434Z","@m":"Received failed HTTP shipping result Forbidden: \"{\\\"Error\\\":\\\"An API key is required for writing events.\\\"}\"","@i":"d3c0dc9f","@l":"Error","StatusCode":"Forbidden","Result":"{\"Error\":\"An API key is required for writing events.\"}"}

I have tried removing the forwarder and all of the files (buffers) from the machine and reinstalling, but same results. For kicks I tried assigning the forwarder it's own API key, but that does not seem to help.

Also, wanted to throw out there that the Seq Forwarder Administration should read the existing settings if it is being run, it is annoying to have to retype the info in if I open it up again.

nblumhardt commented 6 years ago

Hi Dan - thanks for writing up the details; sorry I dropped the ball on this.

I inspected the source to see whether I could find any clues why this might be happening, but didn't have any success. I'll take another look now and see if I can come up with any hypotheses.

nblumhardt commented 6 years ago

I've run through the code again and can't find any convincing leads. I think there are two possible cases that could cause this (ruling out the ones that I think are excluded by invariants in the forwarder code):

  1. The events are reaching Seq with the wrong API key

If this is the case, you'll see messages in Seq's Diagnostics > Ingestion log in pairs, like:

2017-12-29 06:12:47 Payload from "::1" specified invalid API key "74BqyqgAHgXQxXwhJPn2"
2017-12-29 06:12:47 Rejecting payload from "::1" due to missing or invalid API key

I think this situation would most likely be caused by a bug in the forwarder, e.g. around encrypting, storing, loading, and decrypting API keys.

  1. The events are reaching Seq without an API key at all

This would produce single events in the Seq ingestion log, like:

2017-12-29 06:12:47 Rejecting payload from "::1" due to missing or invalid API key

This could also be a bug in the forwarder, but I have traced through and think we'd see an exception earlier in the process if this was the case. Is there a chance that the firewall (if it's not a simple firewall) could be stripping the X-Seq-ApiKey header from requests?

Best regards, Nick

danjohnso commented 6 years ago

No worries, I was off for a while here and was kicking this down the road until the overlords figure how an email filter works...

On the ingestion log, I do see:

2017-12-28 13:08:09 Rejecting payload from "server2" due to missing or invalid API key
2017-12-28 13:09:30 Payload from "server2" specified invalid API key "AtGHinrvVgBctn8CzgVg"
2017-12-28 13:09:30 Rejecting payload from "server2" due to missing or invalid API key
2017-12-28 13:09:30 Payload from "server2" specified invalid API key "Cze9HNt8LbMCUS3Sd0T"
2017-12-28 13:09:30 Rejecting payload from "server2" due to missing or invalid API key

Based on that, I would guess encryption decryption issue. I can't find those key values in Seq or in any of our application configurations. Since several applications are getting through, I doubt the firewall is stripping the header, I will check though to be sure when the firewall team gets back next year.

nblumhardt commented 6 years ago

Thanks for the reply. Would it be possible to:

This would help track down e.g. any truncation issues; the second API key above is only 19 characters, which is one character short of what Seq will normally generate.

Thanks!

danjohnso commented 6 years ago

Here are all the keys generated:

pxPOVhJUJ02DlekHcvdI KLpSXVORSaEO6MNKjan7 vahkFzU4g3bQyi02sMD1 qlhjeyWPRvapqCnO9rwo HmRV9hIQTaR1Cs3Ocq7 uryg55W8gBqa6Pwd82 37kTOd0zS0mnXnxDKnBW X2RLyxNVI4HEDyvhBB fAtquOqNSagUwGI2tgV kcCJFEpum5DfZlHRD 0Zs6YhuZGwnASs1FDL 9KQVGcQp2D3BGpxiqyJW QcmDoodcJ9mCaYrAs eaeuNCtUadJzaVjkWaSz vzviDUx7JY7hF9XMpuu2 EIfm0ZuoEPx83JRWOH

I know for sure this application key is not working -> fAtquOqNSagUwGI2tgV Not sure right now which other one is not logging, let me try to find out.

nblumhardt commented 6 years ago

Hi Dan- any luck tracking down which is the other key? It'd be great to know whether the problematic ones correlate with the ones being reported in the ingestion log in any way.

danjohnso commented 6 years ago

Its not entirely correlated. I also know this key is not sending, but nothing around the right time exists in the ingestion log: qlhjeyWPRvapqCnO9rwo

We have added file loggers to everything as a secondary log, but I haven't been able to narrow down what that second key is in the ingestion log.

I'm not seeing any access denied today on the shipper, just TaskCanceled exceptions:

{"@t":"2018-01-08T17:14:49.2359274Z","@m":"Responding with Created","@i":"becc26e7","@l":"Debug","StatusCode":"Created"}
{"@t":"2018-01-08T17:15:06.1580844Z","@m":"Creating a new API key-specific log buffer in \"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\f9627d91321f4aca8aaa2b6b50aaecec\"","@i":"c3201009","Path":"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\f9627d91321f4aca8aaa2b6b50aaecec","SourceContext":"Seq.Forwarder.Multiplexing.ActiveLogBufferMap"}
{"@t":"2018-01-08T17:15:06.1580844Z","@m":"Log buffer open on \"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\f9627d91321f4aca8aaa2b6b50aaecec\"; 0 entries, next key will be 1","@i":"505684b8","BufferPath":"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\f9627d91321f4aca8aaa2b6b50aaecec","Entries":0,"NextId":1}
{"@t":"2018-01-08T17:15:06.1580844Z","@m":"Log shipper started, events will be dispatched to \"server3\"","@i":"8bd6415c","ServerUrl":"server3"}
{"@t":"2018-01-08T17:15:06.1580844Z","@m":"Responding with Created","@i":"becc26e7","@l":"Debug","StatusCode":"Created"}
{"@t":"2018-01-08T17:15:06.3611850Z","@m":"Exception while sending a batch from the log shipper","@i":"3d7bf92e","@l":"Error","@x":"System.Threading.Tasks.TaskCanceledException: A task was canceled.\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Seq.Forwarder.Shipper.HttpLogShipper.<OnTickAsync>d__20.MoveNext()"}
{"@t":"2018-01-08T17:15:08.3299708Z","@m":"Responding with Created","@i":"becc26e7","@l":"Debug","StatusCode":"Created"}
{"@t":"2018-01-08T17:15:10.3456084Z","@m":"Responding with Created","@i":"becc26e7","@l":"Debug","StatusCode":"Created"}
{"@t":"2018-01-08T17:15:10.8925135Z","@m":"Responding with Created","@i":"becc26e7","@l":"Debug","StatusCode":"Created"}
{"@t":"2018-01-08T17:15:12.3612873Z","@m":"Responding with Created","@i":"becc26e7","@l":"Debug","StatusCode":"Created"}
{"@t":"2018-01-08T17:15:14.6425644Z","@m":"Exception while sending a batch from the log shipper","@i":"3d7bf92e","@l":"Error","@x":"System.Threading.Tasks.TaskCanceledException: A task was canceled.\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Seq.Forwarder.Shipper.HttpLogShipper.<OnTickAsync>d__20.MoveNext()"}
{"@t":"2018-01-08T17:15:23.5489544Z","@m":"Exception while sending a batch from the log shipper","@i":"3d7bf92e","@l":"Error","@x":"System.Threading.Tasks.TaskCanceledException: A task was canceled.\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Seq.Forwarder.Shipper.HttpLogShipper.<OnTickAsync>d__20.MoveNext()"}
nblumhardt commented 6 years ago

Interesting - the TaskCanceledException is likely the HTTP client timing out, which might happen for a few reasons. Capturing/viewing the raw HTTP traffic with Fiddler might suggest some possibilities - perhaps that's the next avenue to explore?

Best regards, Nick

danjohnso commented 6 years ago

Aha! Didn't really think about this, but we are hitting the system port limit. Haven't dug too deeply into HttpClient usage in the code here, but I'm sure you have seen this article before (https://aspnetmonsters.com/2016/08/2016-08-27-httpclientwrong/).

Running netstat shows thousands of ports stuck in FIN_WAIT_1:

TCP server2:65429 server3:45003 FIN_WAIT_1

All of the task cancels must be port exhaustion, would probably account for most of the issues we are seeing (minus the 2 keys that appear corrupted).

I verified it is not our applications using up the ports as we communicate to a different port on server3. We use a simple wrapper on our HttpClient that controls the socket lifetime and timeout, I can take a look at usage here in the forwarder and try to fix it.

danjohnso commented 6 years ago

So @nblumhardt, without understanding a lot of the architecture of the httpshipper, I started a PR for preventing port exhaustion. The one thing I don't want to touch without your input is how the HttpClient gets built. We usually run it as a singleton per base url, but the current pattern doesn't look like it will make that very easy.

See #38

danjohnso commented 6 years ago

So I installed that special build and after the weekend still see the same issue. I checked prod and we are not seeing this behavior, so I am going to go back to the Firewall guys and see if something is getting dropped that causes this to recycle so much.

danjohnso commented 6 years ago

Alright @nblumhardt , firewall was fine, but apparently there was a new intrusion detection piece of hardware that wasn't playing along. Looks like the port issue is resolved, but the corrupted keys are still a problem. Still seeing rejected keys of:

Cze9HNt8LbMCUS3Sd0T AtGHinrvVgBctn8CzgVg

Finally confirmed which apps are not logging and it lines up as a 1 to 1, so hopefully this helps.

X2RLyxNVI4HEDyvhBB fAtquOqNSagUwGI2tgV

nblumhardt commented 6 years ago

@danjohnso that's great news! Good to narrow things down to a more manageable problem.

Looking at the rejected keys vs the configured ones, it strikes me that these probably aren't a corruption - for that to have happened, one would expect at the least that the character set would not be the same, valid alphanumeric set used by Seq API keys.

My current best guess is that the apps which should be logging with the new keys are somehow using older keys which are no longer valid. Is there a chance the deployment of those apps hasn't updated their configured API keys correctly, or perhaps the processes haven't restarted and picked up the new keys?

We could also be being thrown off my events buffered in the forwarder using old/no-longer-valid API keys; these events will keep getting retried until they succeed.

One thing we can try is:

  1. Grab one of the rejected keys from the ingestion log
  2. Create a new API key, and specify the token from (1.) explicitly
  3. Configure the key so that a property is attached, like InvalidKey = yes (make sure you press "Add" beside the property name before saving the key)
  4. Wait an hour or two, to make sure the events are flushed through
  5. Run a log search with the filter Has(InvalidKey) to inspect the events that were sent with the key

This will at least tell us where the rejected events are coming from, which could provide some new angles to investigate.

What do you think?

danjohnso commented 6 years ago

So this exercise has been good justification for buying a Seq license.

Followed steps above and all of the buffers flushed successfully. Found out there was an old dev server still turned on I didn't know about, was supposedly decommissioned...so that took care of 1 key. Other key turns out to be a deployment/source control issue. Got that resolved so the 1 application is logging correctly even after deployments. Final application turned out to be a config file issue, the web.config value had a period delimiter instead of a colon on the apiKey.

Confirmed that we are holding steady at a single port of usage by HttpClient, not sure that build is necessary for the main code branch though. Before the IPS issue was discovered, I assumed the buffers were getting recreated on recycle or service restart, but an HttpClient per buffer would improve throughput for most scenarios.

Thanks for your help in troubleshooting this!

nblumhardt commented 6 years ago

Fantastic news, thanks for closing the loop, Dan!