Logging to Seq stops, other sinks keep working.

PeterRockstars commented 4 years ago

We have an application whose logs arrive in Seq for a while but then stop showing up in Seq. The other configured sink, the console, keeps working normally. This happens every time we run this app. Other services with identical configuration don't have this problem. The SelfLog logs these kinds of errors around the time that the logging stops showing up:

2019-12-03T10:27:29.3160239Z HTTP shipping failed with BadRequest: {"Error":"Invalid raw event JSON, body could not be parsed."}; dumping payload to /app/logs/invalid-BadRequest-589e5d14d9cf4dd98f51e4e857b23912.json

Looking at the file I indeed see malformed JSON:

{ "Events": [ExecuteAsync>b0>d.MoveNext()\n--- End of stack trace from previous location where exception was thrown ---\n at Microsoft.Azure.Documents.BackoffRetryUtility`1.ExecuteRetryAsync>cDisplayClass3_1`1.<b__0>d.MoveNext()","Properties":{"someProperty":"SomeValue","someProperty":"SomeValue"}},{"Timestamp":"2019-12-03T10: 27: 27.3309556+00: 00","Level":"Debug","MessageTemplate":"Saving aggregate: {aggregateName } {aggregateId }","Properties":{"someProperty":"SomeValue","someProperty":"SomeValue"}},{"Timestamp":"2019-12-03T10: 27: 27.4330501+00: 00","Level":"Debug","MessageTemplate":"Executing command {@command } on {commandHandlers } ({@currentTrigger })","Properties":{"someProperty":"SomeValue","someProperty":"SomeValue"} }

I removed some sensitive information but you can see that the first event is not a complete JSON object, it seems the first bit is missing. We use the Durable sink and the buffer file also contains these kinds of incomplete entries.

The ingestion logs also show this:

2019-12-03 10:26:22 Rejecting payload from "::ffff:10.240.0.112" due to invalid JSON, request body could not be parsed Newtonsoft.Json.JsonReaderException: After parsing a value an unexpected character was encountered: :. Path 'Events[1]', line 1, position 36. at Newtonsoft.Json.JsonTextReader.ParsePostValue(Boolean ignoreComments) at Newtonsoft.Json.JsonTextReader.Read() at Newtonsoft.Json.JsonWriter.WriteToken(JsonReader reader, Boolean writeChildren, Boolean writeDateConstructorAsDate, Boolean writeComments) at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateJToken(JsonReader reader, JsonContract contract) at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.Deserialize(JsonReader reader, Type objectType, Boolean checkAdditionalContent) at Newtonsoft.Json.JsonSerializer.DeserializeInternal(JsonReader reader, Type objectType) at Newtonsoft.Json.JsonSerializer.Deserialize[T](JsonReader reader) at Seq.Server.Web.Api.RawEventsController.TryParseRawFormatBody(ApiKey apiKey, StorageEventCreationData[]& events, String& errorMessage) 2019-12-03 10:27:27 Rejecting payload from "::ffff:10.240.0.13" due to invalid JSON, request body could not be parsed Newtonsoft.Json.JsonReaderException: Unterminated string. Expected delimiter: ". Path 'Events[31].Exception', line 1, position 44809. at Newtonsoft.Json.JsonTextReader.ReadStringIntoBuffer(Char quote) at Newtonsoft.Json.JsonTextReader.ParseValue() at Newtonsoft.Json.JsonWriter.WriteToken(JsonReader reader, Boolean writeChildren, Boolean writeDateConstructorAsDate, Boolean writeComments) at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateJToken(JsonReader reader, JsonContract contract) at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.Deserialize(JsonReader reader, Type objectType, Boolean checkAdditionalContent) at Newtonsoft.Json.JsonSerializer.DeserializeInternal(JsonReader reader, Type objectType) at Newtonsoft.Json.JsonSerializer.Deserialize[T](JsonReader reader) at Seq.Server.Web.Api.RawEventsController.TryParseRawFormatBody(ApiKey apiKey, StorageEventCreationData[]& events, String& errorMessage) 2019-12-03 10:27:29 Rejecting payload from "::ffff:10.240.0.13" due to invalid JSON, request body could not be parsed Newtonsoft.Json.JsonReaderException: Unexpected character encountered while parsing value: E. Path 'Events', line 1, position 11. at Newtonsoft.Json.JsonTextReader.ParseValue() at Newtonsoft.Json.JsonWriter.WriteToken(JsonReader reader, Boolean writeChildren, Boolean writeDateConstructorAsDate, Boolean writeComments) at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateJToken(JsonReader reader, JsonContract contract) at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.Deserialize(JsonReader reader, Type objectType, Boolean checkAdditionalContent) at Newtonsoft.Json.JsonSerializer.DeserializeInternal(JsonReader reader, Type objectType) at Newtonsoft.Json.JsonSerializer.Deserialize[T](JsonReader reader) at Seq.Server.Web.Api.RawEventsController.TryParseRawFormatBody(ApiKey apiKey, StorageEventCreationData[]& events, String& errorMessage)

The invalid JSON issue also appears in some of the other services, but none of their logs stop showing up in Seq.

Any clue what may be going on here?

nblumhardt commented 4 years ago

Hi Peter, thanks for getting in touch.

This looks like some kind of "bookmark file" mismatch or corruption issue; the kinds of things I'd look at are:

Is the filesystem some kind of (possibly unreliable) network one, e.g. SMB?
Are multiple instances of the service (e.g. web processes run by IIS) trying to use the same file et?
Does clearing out the whole log file set and bookmark file for the service resolve the issue?

Digging into one of the log files from the service (in the durable set) to see what the JSON looks like might be useful, if you can get your hands on one.

Hope this helps, and let me know how you go, Nick

PeterRockstars commented 4 years ago

Hi Nick,

Thanks for replying so quickly. I forgot to mention, the service is running in a container, so contention with other processes seems unlikely. The filesystem is just the local container filesystem. The problem occurs every time a new container starts (meaning a clean file system).

If it is a "bookmark file" mismatch or corruption issue, it should be resolved by switching to the non-durable sink, correct? I tried that yesterday and the problem seemed to still occur. I will try again today to make sure.

PeterRockstars commented 4 years ago

I just tried disabling the Durable sink and the problem does not occur, so it is definitely related to the Durable sink. I will ask management if I can share one of the log files.

PeterRockstars commented 4 years ago

I have emailed you the relevant JSON files as they contain sensitive data. Let me know if you need anything else.

nblumhardt commented 4 years ago

Thanks for the additional info, Peter.

The filesystem is just the local container filesystem

Sorry, I should have been clearer - this will be based on some kind of storage volume in most cases; e.g. if the container is an Azure Container Instance, it'll be Azure Files; on AWS it'll likely be EBS, etc. - what's the underlying technology stack?

I've checked the logs; the log file itself is fine. My suspicion is that this code:

https://github.com/serilog/serilog-sinks-seq/blob/dev/src/Serilog.Sinks.Seq/Sinks/Seq/Durable/PayloadReader.cs#L65

doesn't handle out-of-order disk page flushes, which it ideally should; different filesystems/storage volumes will bring out these kinds of bugs, though.

Would be great to know more about the storage; I'll give the flush/sync issue some more thought. Cheers!

PeterRockstars commented 4 years ago

Thanks for the additional info, Peter.

The filesystem is just the local container filesystem

Sorry, I should have been clearer - this will be based on some kind of storage volume in most cases; e.g. if the container is an Azure Container Instance, it'll be Azure Files; on AWS it'll likely be EBS, etc. - what's the underlying technology stack?

I've checked the logs; the log file itself is fine. My suspicion is that this code:

https://github.com/serilog/serilog-sinks-seq/blob/dev/src/Serilog.Sinks.Seq/Sinks/Seq/Durable/PayloadReader.cs#L65

doesn't handle out-of-order disk page flushes, which it ideally should; different filesystems/storage volumes will bring out these kinds of bugs, though.

Would be great to know more about the storage; I'll give the flush/sync issue some more thought. Cheers!

You're welcome, sorry I wasn't clear about the file system. The container runs in Kubernetes in Azure (AKS). The log files are not placed on any explicitly mapped volume. So I would think that Docker should abstract away any filesystem idiosyncrasies, right? In any case, the VM's that the nodes run on are of the type "Standard_B4ms" and the OS Disk type is "Premium SSD 100Gib". Does that tell you anything?

nblumhardt commented 4 years ago

Thanks for your reply.

If there's no persistent volume mapped then I'm not sure what the semantics will be on AKS. One way to rule this out as a possible cause would be to mount a ReadWriteOnce persistent volume based on Azure Disks (https://docs.microsoft.com/en-us/azure/aks/azure-disks-dynamic-pv).

If you have a chance to check out that option, it might provide a faster result than chasing possible flush/sync issues here, and might confirm whether or not the issue is storage-specific. There's a good chance that the bug is in this sink, but that it's benign on most of the common file systems - making it tricky to identify/address directly :-)

PeterRockstars commented 4 years ago

I think I've found the problem, it seems it was our own mistake.

You see, in this application we setup two IWebHost's, one is the main one and the other is used to perform some data seeding tasks at startup. That means that .UseSerilog is also called twice, and both read the same configuration from our appsettings. My guess is that having two instances of HttpLogShipper reading from the same buffer files and writing to the same bookmark file is not a supported use case 🙂

nblumhardt commented 4 years ago

🎉 that's great! Glad you were able to track it down.

The bookmark file should be used exclusively by one process; as you'll see in BookmarkFile it's opened with FileShare.Read and so is normally robust against this kind of issue.

I think we can probably put that down to the idiosyncrasies of the deployment environment and close this here.

Thanks for the follow-up 👍

datalust / serilog-sinks-seq

Logging to Seq stops, other sinks keep working. #128