Closed PeterRockstars closed 4 years ago
Hi Peter, thanks for getting in touch.
This looks like some kind of "bookmark file" mismatch or corruption issue; the kinds of things I'd look at are:
Digging into one of the log files from the service (in the durable set) to see what the JSON looks like might be useful, if you can get your hands on one.
Hope this helps, and let me know how you go, Nick
Hi Nick,
Thanks for replying so quickly. I forgot to mention, the service is running in a container, so contention with other processes seems unlikely. The filesystem is just the local container filesystem. The problem occurs every time a new container starts (meaning a clean file system).
If it is a "bookmark file" mismatch or corruption issue, it should be resolved by switching to the non-durable sink, correct? I tried that yesterday and the problem seemed to still occur. I will try again today to make sure.
I just tried disabling the Durable sink and the problem does not occur, so it is definitely related to the Durable sink. I will ask management if I can share one of the log files.
I have emailed you the relevant JSON files as they contain sensitive data. Let me know if you need anything else.
Thanks for the additional info, Peter.
The filesystem is just the local container filesystem
Sorry, I should have been clearer - this will be based on some kind of storage volume in most cases; e.g. if the container is an Azure Container Instance, it'll be Azure Files; on AWS it'll likely be EBS, etc. - what's the underlying technology stack?
I've checked the logs; the log file itself is fine. My suspicion is that this code:
doesn't handle out-of-order disk page flushes, which it ideally should; different filesystems/storage volumes will bring out these kinds of bugs, though.
Would be great to know more about the storage; I'll give the flush/sync issue some more thought. Cheers!
Thanks for the additional info, Peter.
The filesystem is just the local container filesystem
Sorry, I should have been clearer - this will be based on some kind of storage volume in most cases; e.g. if the container is an Azure Container Instance, it'll be Azure Files; on AWS it'll likely be EBS, etc. - what's the underlying technology stack?
I've checked the logs; the log file itself is fine. My suspicion is that this code:
doesn't handle out-of-order disk page flushes, which it ideally should; different filesystems/storage volumes will bring out these kinds of bugs, though.
Would be great to know more about the storage; I'll give the flush/sync issue some more thought. Cheers!
You're welcome, sorry I wasn't clear about the file system. The container runs in Kubernetes in Azure (AKS). The log files are not placed on any explicitly mapped volume. So I would think that Docker should abstract away any filesystem idiosyncrasies, right? In any case, the VM's that the nodes run on are of the type "Standard_B4ms" and the OS Disk type is "Premium SSD 100Gib". Does that tell you anything?
Thanks for your reply.
If there's no persistent volume mapped then I'm not sure what the semantics will be on AKS. One way to rule this out as a possible cause would be to mount a ReadWriteOnce
persistent volume based on Azure Disks (https://docs.microsoft.com/en-us/azure/aks/azure-disks-dynamic-pv).
If you have a chance to check out that option, it might provide a faster result than chasing possible flush/sync issues here, and might confirm whether or not the issue is storage-specific. There's a good chance that the bug is in this sink, but that it's benign on most of the common file systems - making it tricky to identify/address directly :-)
I think I've found the problem, it seems it was our own mistake.
You see, in this application we setup two IWebHost's, one is the main one and the other is used to perform some data seeding tasks at startup. That means that .UseSerilog is also called twice, and both read the same configuration from our appsettings. My guess is that having two instances of HttpLogShipper reading from the same buffer files and writing to the same bookmark file is not a supported use case 🙂
🎉 that's great! Glad you were able to track it down.
The bookmark file should be used exclusively by one process; as you'll see in BookmarkFile
it's opened with FileShare.Read
and so is normally robust against this kind of issue.
I think we can probably put that down to the idiosyncrasies of the deployment environment and close this here.
Thanks for the follow-up 👍
We have an application whose logs arrive in Seq for a while but then stop showing up in Seq. The other configured sink, the console, keeps working normally. This happens every time we run this app. Other services with identical configuration don't have this problem. The SelfLog logs these kinds of errors around the time that the logging stops showing up:
Looking at the file I indeed see malformed JSON:
I removed some sensitive information but you can see that the first event is not a complete JSON object, it seems the first bit is missing. We use the Durable sink and the buffer file also contains these kinds of incomplete entries.
The ingestion logs also show this:
The invalid JSON issue also appears in some of the other services, but none of their logs stop showing up in Seq.
Any clue what may be going on here?