datalust / serilog-sinks-seq

A Serilog sink that writes events to the Seq structured log server
https://datalust.co/seq
Apache License 2.0
225 stars 50 forks source link

Corrupted json generated under load (?) #207

Closed MaceWindu closed 7 months ago

MaceWindu commented 7 months ago

We have an issue when generated payload json is rejected by seq due to two-characters missing from it. Basically it looks same as #28 issue.

It could happen like xx times a day or not happen for weeks. As it usually happens when server struggling under high load I think it could be some races in code, but I wasn't able to find anything suspicious in library code so far.

Library version is 6.0.0

Seq logs:

[2024-02-17T03:04:06.3843507+00:00 DBG] Rejecting CLEF payload from <client_ip> due to invalid JSON, item could not be parsed; 

Application logs:

Logger crashed with 2024-02-17T07:03:52.7761603Z HTTP shipping failed with BadRequest; dumping payload to <path-to.json>
nblumhardt commented 7 months ago

Thanks for the heads-up; are you also using the bufferBaseFilename option as in the other ticket? Could you please share some additional details of your Serilog.Sinks.Seq setup - options used, operating system, .NET version? Thanks!

MaceWindu commented 7 months ago
var levelSwitch = new LoggingLevelSwitch()
{
    MinimumLevel = LogEventLevel.Verbose
};

var loggerConfig = new LoggerConfiguration().MinimumLevel.ControlledBy(levelSwitch);
// + bunch of property enrichers

var logger = loggerConfig
    .WriteTo.Seq(
    seqUrl,
    //apiKey: settings.SeqApiKey,
    controlLevelSwitch: levelSwitch,
    bufferBaseFilename: buffer)
    .CreateLogger();

// Microsoft.Extensions.Logging factory
var factory = LoggerFactory.Create(builder => builder.AddSerilog(logger, dispose: true));

Currently trying to reproduce issue in sample console app without success yet.

But was able to reproduce another issue we have sometimes: when I use large messages (but smaller than default limit of ~261k), provider often tries to send batches with size > 10Mb (up to ~100Mb) failing with RequestEntityTooLarge error from server. Errors goes away after messages sizes reduced to some point.

RequestEntityTooLarge error repro. Here with 2_000 array elements (arrays used tp have high density of JSON syntax in message to detect original issue) I observe packets of 14Mb with max message size in bytes ~23Kb. With bigger arrays packet size increases.

var log1 = factory.CreateLogger("Test1");
var log2 = factory.CreateLogger("Test 2");
var log3 = factory.CreateLogger("Tes3");

var size = 2_000;
var data1 = new int[size];
var data2 = new bool[size];
var data3 = new TimeSpan[size];
for (var i = 0; i < data1.Length; i++)
    data1[i] = i;
for (var i = 0; i < data2.Length; i++)
    data2[i] = i % 2 == 0 ? true : false;
for (var i = 0; i < data3.Length; i++)
    data3[i] = TimeSpan.FromMinutes(i);

var size1 = Encoding.UTF8.GetBytes(JsonSerializer.Serialize(data1)).Length;
var size2 = Encoding.UTF8.GetBytes(JsonSerializer.Serialize(data2)).Length;
// ~22Kb
var size3 = Encoding.UTF8.GetBytes(JsonSerializer.Serialize(data3)).Length;

for (var i = 0; i < 50; i++)
{
    Task.Run(() =>
    {
        var cnt = 100_000;
        while (cnt-- > 0)
        {
            log1.LogError("Test:{data1}", data1);
            log2.LogWarning("Test: {data2}", data2);
            log3.LogInformation("T est: {data3}", data3);
        }
    });
}

await Task.Delay(TimeSpan.FromHours(1));
nblumhardt commented 7 months ago

Thanks for the follow-up. Unless you're working with occasionally-connected machines, I'd strongly recommend using direct delivery instead of local buffering.

Because "durable" mode uses a simple local text file instead of (e.g.) a database engine, it doesn't have the degree of robustness required for low-maintenance/high-throughput logging.

Without the buffer file, the sink still internally uses a memory buffer to smooth out occasional periods of network or server unavailability (up to around ten minutes), so it's rare to lose logs just because of say, a server restart.

If you're concerned about having a fail-safe copy of the data anyway, writing the logs to a rolling set of newline-delimited JSON files locally should provide that.

Let me know if this helps, or if you would like to dig into this scenario further.

MaceWindu commented 7 months ago

That's clear. We prefer to have buffering on even with such errors as they really rare. E.g. yesterday we discovered one of processes collecting logs to files but for some reason not sending them to Seq for days (cannot say why as I didn't looked into it, could be our issue) so buffering helped us to preserve those logs and send them to Seq after application restart.

Let's close it for now. I will look into it later (your words gave me idea to test application/seq restart scenarios) and if it will rise some results - will reopen it. Still it would be nice to have fix for second issue with oversized requests as it is easy to reproduce.