elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
123 stars 132 forks source link

Elastic Agent Policy is broken when JSON contains numeric string names #4200

Open aleksmaus opened 7 months ago

aleksmaus commented 7 months ago

Tested with the latest Agent 8.13. This is first uncovered with osquery when the users tried to use numbers as query id https://github.com/elastic/kibana/issues/175421

The incoming policy is correct. Narrowed it down to to c, err := config.NewConfigFrom(action.Policy) in PolicyChangeHandler https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/actions/handlers/handler_action_policy_change.go#L101

Repro code, for example, with JSON extracted from the policy

func TestNewConfigFrom(t *testing.T) {
    var m map[string]interface{}
    err := json.Unmarshal([]byte(testPolicy), &m)
    if err != nil {
        t.Fatal(err)
    }

    cfg, err := NewConfigFrom(m)
    if err != nil {
        t.Fatal(err)
    }
    m, err = cfg.ToMapStr()
    if err != nil {
        t.Fatal(err)
    }
    s, err := json.Marshal(m)
    _ = err
    fmt.Printf("RESULT:%s\n", string(s))
}

const testPolicy = `
{
    "inputs": [
        {
            "osquery": {
                "packs": {
                    "1233": {
                        "queries": {
                            "test": {
                                "interval": 60,
                                "query": "select * from uptime",
                                "removed": false,
                                "snapshot": true,
                                "timeout": 60
                            }
                        }
                    },
                    "test": {
                        "queries": {
                            "12312": {
                                "interval": 60,
                                "query": "select * from uptime",
                                "timeout": 60
                            }
                        },
                        "shard": 100
                    }
                }
            }
        }
    ]
}
`

Result (truncated) looks like this:

{
    "inputs": [
        {
            "osquery": {
                "packs": {
                    "0": null,
                    "1": null,
                    "10": null,
                    "100": null,
                    "1000": null,
                    "1001": null,
                    ...........................
                    "1225": null,
                    "1226": null,
                    "1227": null,
                    "1228": null,
                    "1229": null,
                    "123": null,
                    "1230": null,
                    "1231": null,
                    "1232": null,
                    "1233": {
                        "queries": {
                            "test": {
                                "interval": 60,
                                "query": "select * from uptime",
                                "removed": false,
                                "snapshot": true,
                                "timeout": 60
                            }
                        }
                    },
                    "124": null,
                    "125": null,
                    "994": null,
                    .....
                    "995": null,
                    "996": null,
                    "997": null,
                    "998": null,
                    "999": null,
                    "test": {
                        "queries": [
                            null,
                            null,
                            null,
                            null,
                            null,
                            .....................
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            null,
                            {
                                "interval": 60,
                                "query": "select * from uptime",
                                "timeout": 60
                            }
                        ],
                        "shard": 100
                    }
                }
            }
        }
    ]
}

Full result is attached, since it doesn't fit into description result.json

aleksmaus commented 7 months ago

Assigning this to myself for now. Will dig some more today.

aleksmaus commented 7 months ago

Changing "numeric" strings to "text" in the input config eliminates the issue:

Input:

{
    "inputs": [
        {
            "osquery": {
                "packs": {
                    "a1233": {
                        "queries": {
                            "test": {
                                "interval": 60,
                                "query": "select * from uptime",
                                "removed": false,
                                "snapshot": true,
                                "timeout": 60
                            }
                        }
                    },
                    "test": {
                        "queries": {
                            "a12312": {
                                "interval": 60,
                                "query": "select * from uptime",
                                "timeout": 60
                            }
                        },
                        "shard": 100
                    }
                }
            }
        }
    ]
}

Output:

{
    "inputs": [
        {
            "osquery": {
                "packs": {
                    "a1233": {
                        "queries": {
                            "test": {
                                "interval": 60,
                                "query": "select * from uptime",
                                "removed": false,
                                "snapshot": true,
                                "timeout": 60
                            }
                        }
                    },
                    "test": {
                        "queries": {
                            "a12312": {
                                "interval": 60,
                                "query": "select * from uptime",
                                "timeout": 60
                            }
                        },
                        "shard": 100
                    }
                }
            }
        }
    ]
}
tomsonpl commented 7 months ago

Thank you @aleksmaus 🙇

aleksmaus commented 7 months ago

Abbreviated repro case with go-ucfg

func TestConfigBug(t *testing.T) {
    var m map[string]interface{}
    _ = json.Unmarshal([]byte(testConfig), &m)

    cfg, _ := ucfg.NewFrom(m)

    c := newConfigFrom(cfg)

    m, _ = c.ToMapStr()
    s, _ := json.Marshal(m)
    fmt.Printf("RESULT:%s\n", string(s))
}

const testConfig = `{"a":{"12":{}}}`

Output (the length of the array is 13 == 12+1):

RESULT:{"a":[null,null,null,null,null,null,null,null,null,null,null,null,null]}

Not super familiar with go-ucfg, but the field name is parsed into number here https://github.com/elastic/go-ucfg/blob/main/path.go#L79

and here is where the array idx+1 is created https://github.com/elastic/go-ucfg/blob/main/ucfg.go#L295

aleksmaus commented 7 months ago

Another unrelated caveat

func TestConfigBug(t *testing.T) {
    var m map[string]interface{}
    _ = json.Unmarshal([]byte(testConfig), &m)

    cfg, _ := ucfg.NewFrom(m)

    c := newConfigFrom(cfg)

    m, _ = c.ToMapStr()
    s, _ := json.Marshal(m)
    fmt.Printf("RESULT:%s\n", string(s))
}

const testConfig = `{"a":{"b12":{}}}`

Results in the output:

RESULT:{"a":{"b12":null}}

Which is not quite correct, since the it was "b12":{} (empty object) in the input document.

aleksmaus commented 7 months ago

the worse case

INPUT:

{"a":{"9223372036854775807":{}}}

RESULT:

panic: runtime error: makeslice: len out of range [recovered]
    panic: runtime error: makeslice: len out of range

INPUT:

{"a":{"9223372036854":{}}}

RESULT:

runtime: out of memory: cannot allocate 147573956411392-byte block (3833856 in use)
fatal error: out of memory
aleksmaus commented 7 months ago

Made some changed in go-ucfg lib in order to address the issues above, PR is open for review.

aleksmaus commented 7 months ago

Testing the change further with the agent and beats. Another place where it breaks is on the beats configuration parsing side: after the agent picked up the "workaround" and sends the correct configuration to the beats, the beats configuration parsing code breaks until go-ucfg and elastic-agent-libs changes are picked up.

Beats generates the config here: https://github.com/elastic/beats/blob/main/x-pack/libbeat/management/generate.go#L156

using elastic-agent-libs conf.NewConfigFrom https://github.com/elastic/elastic-agent-libs/blob/473983911d7c78e57bf30af91f66f5bca50d0a59/config/config.go#L81 where the config options are hardcoded https://github.com/elastic/elastic-agent-libs/blob/473983911d7c78e57bf30af91f66f5bca50d0a59/config/config.go#L44

So it looks like the final scope of the update, if we start with go-ucfg, would include:

  1. go-ucfg
  2. elastic-agent
  3. elastic-agent-libs (config)
  4. beats (would need recompile with the latest go-ucfg and elastic-agent-libs)
aleksmaus commented 7 months ago

After applying the change to elastic-agent-libs to call the newly added ucfg.EnableNumKeys(true) workaround API on go-ucfg, and recompiling the agent and beats with these changes, confirmed OS query is working with numeric query ID.

The incoming configuration is correct:

Screenshot 2024-02-08 at 2 20 11 PM

The results from the scheduled query are posted as expected:

Screenshot 2024-02-08 at 2 19 55 PM

Another possibly faster fix for osquery specific issue with much smaller change surface it to restrict users from assigning the numeric strings to the query ids. @tomsonpl opened a draft PR here https://github.com/elastic/kibana/pull/176507

Still, any place that allows the numeric keys to be inserted into the policy will introduce this kind of issue. Another possible place for example is the oquery advanced configuration section that allows users to specify the free form JSON configuration for osquery.

Another possible approach is to rewrite how incoming map[string]interface{} is parsed into Config so we don't have to work around/fix the current ucfg implementation and possibly break some other users who count on the current behavior, or maybe ditch the ucfg Config altogether. Need some feedback from the agent/beats team here on all the possible consequences of making this change.

@cmacknz @blakerouse @andrewkroh

cmacknz commented 6 months ago

ucfg is still used widely enough in our own code (both beats and agent) that fixing the problem there feels correct. Getting rid of uses of go-ucfg in one place doesn't help the other uses of it.

I can't imagine anyone is depending on this currently completely broken behavior.

aleksmaus commented 6 months ago

PR against go-ucfg is open https://github.com/elastic/go-ucfg/pull/198. I don't have permissions to set reviewers or any labels there though. I noticed @fearful-symmetry also had another PR opened a bit earlier that fixes/"works around" another oddity in that same library. The suggestion was to bring this up during the next agent meeting.

aleksmaus commented 6 months ago

The PR is merged. Need to update the Agent/beats/libs to pick up the version with this change.