elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.17k stars 4.91k forks source link

Filebeat's journald input might return an empty message that breaks multiline parser #41331

Open belimawr opened 4 days ago

belimawr commented 4 days ago

When using the journald input to read from a file, there is a race condition that happens when journalctl exits. Whenever the journald input starts and there is a chance of having messages from previous boots (depending on filtering arguments) we first start the process without the follow flag, read all messages until it exits successfully, then restart it with the follow flag.

During the read of this first execution, there might be a race condition where reading from journalctl stdout returns the following error

read |0: file already closed

That's an fs.PathError

    err := fs.PathError{
        Op:   "read",
        Path: "|0",
        Err:  errors.New("file already closed"),
    }

This happens here: https://github.com/elastic/beats/blob/7be47da326413fc9ae6e96a0c4b467ba481b6210/filebeat/input/journald/pkg/journalctl/journalctl.go#L95

When this happens we just restart journalctl and keep reading, no data is lost.

However, this cause problems when the multiline parse is used. This error makes the reader to return an recoverable error and and empty message, the multiline parser cannot handle this situation and returns early, leading to a wrong aggregation.

This can be seen by running the following test from filebeat/input/journald multiple times until it fails. (usually the easiest run the test with the -count flag: go test -count=100 -run=TestInputParsers)

// TestInputParsers ensures journald input support parsers,
// it only tests a single parser, but that is enough to ensure
// we're correctly using the parsers
func TestInputParsers(t *testing.T) {
    inputParsersExpected := []string{"1st line\n2nd line\n3rd line", "4th line\n5th line\n6th line"}
    env := newInputTestingEnvironment(t)

    inp := env.mustCreateInput(mapstr.M{
        "paths":                 []string{path.Join("testdata", "input-multiline-parser.journal")},
        "include_matches.match": []string{"_SYSTEMD_USER_UNIT=log-service.service"},
        "parsers": []mapstr.M{
            {
                "multiline": mapstr.M{
                    "type":        "count",
                    "count_lines": 3,
                },
            },
        },
    })

    ctx, cancelInput := context.WithCancel(context.Background())
    env.startInput(ctx, inp)
    env.waitUntilEventCount(len(inputParsersExpected))

    for idx, event := range env.pipeline.clients[0].GetEvents() {
        if got, expected := event.Fields["message"], inputParsersExpected[idx]; got != expected {
            t.Errorf("expecting event message %q, got %q", expected, got)
        }
    }

    cancelInput()
}

A similar early return might happen if multiline is used and journalctl exits unexpectedly. This case still needs confirmation.

elasticmachine commented 4 days ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz commented 1 day ago

What are the user visible symptoms of this?

When this happens we just restart journalctl and keep reading, no data is lost.

This is good, it sounds like it shouldn't block GA of the input, but if this has a symptom that would lead to support cases I would want it to be fixed before GA regardless.

belimawr commented 1 day ago

What are the user visible symptoms of this?

If using the multiline parser, messages are sent before all the conditions are met. In the example we use the count method, which means each 3 lies of the input will generate a single event. On that case, we get one event that contains a single line from the input.

So far I have only seen that happening when reading from journal files.

One case that has just crossed my mind: When https://github.com/elastic/beats/pull/41244 is merged, we will have journalctl being restarted at least once during normal operation. That could cause the same observable behaviour. A the moment the only issue I can think of is the multline parser. I need to test it to be 100% sure if that's the case.

but if this has a symptom that would lead to support cases I would want it to be fixed before GA regardless.

I believe it could generate support cases if people are using journald + multiline. I added a note to https://github.com/elastic/beats/pull/41244 to at least check whether this happens or not.