Closed danjohnso closed 6 years ago
Hi Dan - thanks for writing up the details; sorry I dropped the ball on this.
I inspected the source to see whether I could find any clues why this might be happening, but didn't have any success. I'll take another look now and see if I can come up with any hypotheses.
I've run through the code again and can't find any convincing leads. I think there are two possible cases that could cause this (ruling out the ones that I think are excluded by invariants in the forwarder code):
If this is the case, you'll see messages in Seq's Diagnostics > Ingestion log in pairs, like:
2017-12-29 06:12:47 Payload from "::1" specified invalid API key "74BqyqgAHgXQxXwhJPn2"
2017-12-29 06:12:47 Rejecting payload from "::1" due to missing or invalid API key
I think this situation would most likely be caused by a bug in the forwarder, e.g. around encrypting, storing, loading, and decrypting API keys.
This would produce single events in the Seq ingestion log, like:
2017-12-29 06:12:47 Rejecting payload from "::1" due to missing or invalid API key
This could also be a bug in the forwarder, but I have traced through and think we'd see an exception earlier in the process if this was the case. Is there a chance that the firewall (if it's not a simple firewall) could be stripping the X-Seq-ApiKey
header from requests?
Best regards, Nick
No worries, I was off for a while here and was kicking this down the road until the overlords figure how an email filter works...
On the ingestion log, I do see:
2017-12-28 13:08:09 Rejecting payload from "server2" due to missing or invalid API key
2017-12-28 13:09:30 Payload from "server2" specified invalid API key "AtGHinrvVgBctn8CzgVg"
2017-12-28 13:09:30 Rejecting payload from "server2" due to missing or invalid API key
2017-12-28 13:09:30 Payload from "server2" specified invalid API key "Cze9HNt8LbMCUS3Sd0T"
2017-12-28 13:09:30 Rejecting payload from "server2" due to missing or invalid API key
Based on that, I would guess encryption decryption issue. I can't find those key values in Seq or in any of our application configurations. Since several applications are getting through, I doubt the firewall is stripping the header, I will check though to be sure when the firewall team gets back next year.
Thanks for the reply. Would it be possible to:
This would help track down e.g. any truncation issues; the second API key above is only 19 characters, which is one character short of what Seq will normally generate.
Thanks!
Here are all the keys generated:
pxPOVhJUJ02DlekHcvdI KLpSXVORSaEO6MNKjan7 vahkFzU4g3bQyi02sMD1 qlhjeyWPRvapqCnO9rwo HmRV9hIQTaR1Cs3Ocq7 uryg55W8gBqa6Pwd82 37kTOd0zS0mnXnxDKnBW X2RLyxNVI4HEDyvhBB fAtquOqNSagUwGI2tgV kcCJFEpum5DfZlHRD 0Zs6YhuZGwnASs1FDL 9KQVGcQp2D3BGpxiqyJW QcmDoodcJ9mCaYrAs eaeuNCtUadJzaVjkWaSz vzviDUx7JY7hF9XMpuu2 EIfm0ZuoEPx83JRWOH
I know for sure this application key is not working -> fAtquOqNSagUwGI2tgV Not sure right now which other one is not logging, let me try to find out.
Hi Dan- any luck tracking down which is the other key? It'd be great to know whether the problematic ones correlate with the ones being reported in the ingestion log in any way.
Its not entirely correlated. I also know this key is not sending, but nothing around the right time exists in the ingestion log: qlhjeyWPRvapqCnO9rwo
We have added file loggers to everything as a secondary log, but I haven't been able to narrow down what that second key is in the ingestion log.
I'm not seeing any access denied today on the shipper, just TaskCanceled exceptions:
{"@t":"2018-01-08T17:14:49.2359274Z","@m":"Responding with Created","@i":"becc26e7","@l":"Debug","StatusCode":"Created"}
{"@t":"2018-01-08T17:15:06.1580844Z","@m":"Creating a new API key-specific log buffer in \"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\f9627d91321f4aca8aaa2b6b50aaecec\"","@i":"c3201009","Path":"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\f9627d91321f4aca8aaa2b6b50aaecec","SourceContext":"Seq.Forwarder.Multiplexing.ActiveLogBufferMap"}
{"@t":"2018-01-08T17:15:06.1580844Z","@m":"Log buffer open on \"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\f9627d91321f4aca8aaa2b6b50aaecec\"; 0 entries, next key will be 1","@i":"505684b8","BufferPath":"D:\\ProgramData\\Seq\\Forwarder\\Buffer\\f9627d91321f4aca8aaa2b6b50aaecec","Entries":0,"NextId":1}
{"@t":"2018-01-08T17:15:06.1580844Z","@m":"Log shipper started, events will be dispatched to \"server3\"","@i":"8bd6415c","ServerUrl":"server3"}
{"@t":"2018-01-08T17:15:06.1580844Z","@m":"Responding with Created","@i":"becc26e7","@l":"Debug","StatusCode":"Created"}
{"@t":"2018-01-08T17:15:06.3611850Z","@m":"Exception while sending a batch from the log shipper","@i":"3d7bf92e","@l":"Error","@x":"System.Threading.Tasks.TaskCanceledException: A task was canceled.\r\n at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n at Seq.Forwarder.Shipper.HttpLogShipper.<OnTickAsync>d__20.MoveNext()"}
{"@t":"2018-01-08T17:15:08.3299708Z","@m":"Responding with Created","@i":"becc26e7","@l":"Debug","StatusCode":"Created"}
{"@t":"2018-01-08T17:15:10.3456084Z","@m":"Responding with Created","@i":"becc26e7","@l":"Debug","StatusCode":"Created"}
{"@t":"2018-01-08T17:15:10.8925135Z","@m":"Responding with Created","@i":"becc26e7","@l":"Debug","StatusCode":"Created"}
{"@t":"2018-01-08T17:15:12.3612873Z","@m":"Responding with Created","@i":"becc26e7","@l":"Debug","StatusCode":"Created"}
{"@t":"2018-01-08T17:15:14.6425644Z","@m":"Exception while sending a batch from the log shipper","@i":"3d7bf92e","@l":"Error","@x":"System.Threading.Tasks.TaskCanceledException: A task was canceled.\r\n at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n at Seq.Forwarder.Shipper.HttpLogShipper.<OnTickAsync>d__20.MoveNext()"}
{"@t":"2018-01-08T17:15:23.5489544Z","@m":"Exception while sending a batch from the log shipper","@i":"3d7bf92e","@l":"Error","@x":"System.Threading.Tasks.TaskCanceledException: A task was canceled.\r\n at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n at Seq.Forwarder.Shipper.HttpLogShipper.<OnTickAsync>d__20.MoveNext()"}
Interesting - the TaskCanceledException
is likely the HTTP client timing out, which might happen for a few reasons. Capturing/viewing the raw HTTP traffic with Fiddler might suggest some possibilities - perhaps that's the next avenue to explore?
Best regards, Nick
Aha! Didn't really think about this, but we are hitting the system port limit. Haven't dug too deeply into HttpClient usage in the code here, but I'm sure you have seen this article before (https://aspnetmonsters.com/2016/08/2016-08-27-httpclientwrong/).
Running netstat shows thousands of ports stuck in FIN_WAIT_1:
TCP server2:65429 server3:45003 FIN_WAIT_1
All of the task cancels must be port exhaustion, would probably account for most of the issues we are seeing (minus the 2 keys that appear corrupted).
I verified it is not our applications using up the ports as we communicate to a different port on server3. We use a simple wrapper on our HttpClient that controls the socket lifetime and timeout, I can take a look at usage here in the forwarder and try to fix it.
So @nblumhardt, without understanding a lot of the architecture of the httpshipper, I started a PR for preventing port exhaustion. The one thing I don't want to touch without your input is how the HttpClient gets built. We usually run it as a singleton per base url, but the current pattern doesn't look like it will make that very easy.
See #38
So I installed that special build and after the weekend still see the same issue. I checked prod and we are not seeing this behavior, so I am going to go back to the Firewall guys and see if something is getting dropped that causes this to recycle so much.
Alright @nblumhardt , firewall was fine, but apparently there was a new intrusion detection piece of hardware that wasn't playing along. Looks like the port issue is resolved, but the corrupted keys are still a problem. Still seeing rejected keys of:
Cze9HNt8LbMCUS3Sd0T AtGHinrvVgBctn8CzgVg
Finally confirmed which apps are not logging and it lines up as a 1 to 1, so hopefully this helps.
X2RLyxNVI4HEDyvhBB fAtquOqNSagUwGI2tgV
@danjohnso that's great news! Good to narrow things down to a more manageable problem.
Looking at the rejected keys vs the configured ones, it strikes me that these probably aren't a corruption - for that to have happened, one would expect at the least that the character set would not be the same, valid alphanumeric set used by Seq API keys.
My current best guess is that the apps which should be logging with the new keys are somehow using older keys which are no longer valid. Is there a chance the deployment of those apps hasn't updated their configured API keys correctly, or perhaps the processes haven't restarted and picked up the new keys?
We could also be being thrown off my events buffered in the forwarder using old/no-longer-valid API keys; these events will keep getting retried until they succeed.
One thing we can try is:
InvalidKey = yes
(make sure you press "Add" beside the property name before saving the key)Has(InvalidKey)
to inspect the events that were sent with the keyThis will at least tell us where the rejected events are coming from, which could provide some new angles to investigate.
What do you think?
So this exercise has been good justification for buying a Seq license.
Followed steps above and all of the buffers flushed successfully. Found out there was an old dev server still turned on I didn't know about, was supposedly decommissioned...so that took care of 1 key. Other key turns out to be a deployment/source control issue. Got that resolved so the 1 application is logging correctly even after deployments. Final application turned out to be a config file issue, the web.config value had a period delimiter instead of a colon on the apiKey.
Confirmed that we are holding steady at a single port of usage by HttpClient, not sure that build is necessary for the main code branch though. Before the IPS issue was discovered, I assumed the buffers were getting recreated on recycle or service restart, but an HttpClient per buffer would improve throughput for most scenarios.
Thanks for your help in troubleshooting this!
Fantastic news, thanks for closing the loop, Dan!
Sorry to put this out here, but my work email is not letting me unblock your support email account right now.
Server 1: Apps, log to forwarder on Server 2 Server 2: Forwarder host, also apps that log directly to the forwarder on this same server ----------------Firewall (Port specific rule allows seq traffic from Server 2 to Server 3)-------------- Server 3: Seq, also some apps that log directly to seq
We are still seeing random applications that the forwarder refuses to send the logs for, even with changing out the API key now. Some applications have no trouble All of our applications have API keys assigned and are using them, but we still get:
I have tried removing the forwarder and all of the files (buffers) from the machine and reinstalling, but same results. For kicks I tried assigning the forwarder it's own API key, but that does not seem to help.
Also, wanted to throw out there that the Seq Forwarder Administration should read the existing settings if it is being run, it is annoying to have to retype the info in if I open it up again.