elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
14.17k stars 3.49k forks source link

[RFC] Use dot notation for field references. #8772

Open jordansissel opened 6 years ago

jordansissel commented 6 years ago

My original implementation of field references in Logstash came in Logstash 1.2.0. This added [nested][syntax] for field references. Elasticsearch always used dots.

Background

My reason for choosing [brackets] was influenced by my experience with Graphite. Graphite uses dots as a tag separator for a metric. For example production.network.foo.out_bytes. Because (at the time, maybe different now), Graphite only allowed a single dimension for a given tag (a label like "foo") and not a named value like "host=foo"). This was problematic any time you tried to use a network address in a tag -- for example, tracking bandwidth per host: production.network.192.168.1.1.out_bytes suddenly you have a tag named "192" and maybe this makes no sense, so we (users) would end up doing weird things like turning dots-in-labels into underscores, production.network.192_168_1_1.out_bytes, but this had its own user-experience hazards.

This graphite experience gave me pause when it came time to implement nested fields in Logstash. I chose to avoid using dots as a special meaning and chose [brackets] instead. The choosing of brackets was to align with hash fetch syntax in many modern languages: event["foo"]["bar"] for example, in Ruby, Python, etc.

Today

Having consistent field references across products would be nice.

Additionally, users frequently try to use dotted fields in Logstash and are confused when it doesn't work.

Proposal

Support dotted fields for field references in Logstash. Deprecate [bracket] notation.

These would become equivalent:

Concerns

Not to remove [brackets]

I dont' think we should necessarily remove bracket notation because doing so would break 4 years of blog posts, configs, examples, etc. Further, removing may be unnecessarily damaging to users who are already comfortable with the syntax, and we have no strong argument (yet?) for why it should be removed and why they should be burdened with correcting their configs after we break them.

What about arrays?

Our bracket notation supports [foo][0] with 0 as an array offset if [foo] is an array. Dot notation in Elasticsearch has no such notion, to my knowledge.

In Elasticsearch, the following document has two values for foo.bar:

{
  "foo": { 
    "bar": [ "hello", "world" ]
  }
}

But you could access individual ones in Logstash with [foo][bar][0] and [foo][bar][1]. This is important for users as they use filters on complex objects.

original-brownbear commented 6 years ago

@jordansissel we could simply support both for non array access if this actually helps users have a better experience? In terms of code it would be trivial to support both, not sure if there are many downsides to having both? Performance wise it's irrelevant since we're caching the key -> field reference relation anyhow.

jordansissel commented 6 years ago

@original-brownbear I"m OK supporting both, and I would personally choose having both because we have no strong reason to remove [bracket] field field references.

Next steps:

1) Implement dotted field reference support 2) Document the new feature and explain why a user would use one vs the other.

branchnetconsulting commented 5 years ago

I recently burned hours trying to figure out why my logstash config that used a mix of [] and . notation for a field name always failed to work. I had to use . notation for the field in the grok line since [] is not allowed there, but then when trying to later check if the grok-extracted field exists, which requires [] field name notation, then it never saw the field. I am probably an intermediate user of Logstash at this point and the current Logstash non-equivalence of a.b and [a][b] is confusing and counter intuitive to me, and I imagine this has snared plenty of other Logstash users.

All that to say: I heartily support your proposal!

Thanks, Kevin Branch

yaauie commented 5 years ago

I had to use . notation for the field in the grok line since [] is not allowed there

🤔the square-bracket field-reference implementation is acceptable in named grok captures:

input {
  generator {
    count => 1
    message => "2019-03-20T22:17:38Z INFO: it works"
  }
}
filter {
  grok {
    match => {
      "message" => "%{TIMESTAMP_ISO8601:[log][timestamp]} %{LOGLEVEL:[log][level]}:% {GREEDYDATA:[log][data]}"
    }
  }
}
output {
  stdout { codec => rubydebug }
}

Produces:

{
      "sequence" => 0,
           "log" => {
            "level" => "INFO",
             "data" => "it works",
        "timestamp" => "2019-03-20T22:17:38Z"
    },
       "message" => "2019-03-20T22:17:38Z INFO: it works",
    "@timestamp" => 2019-03-20T22:19:42.551Z,
          "host" => "castrovel.local",
      "@version" => "1"
}

That said, I'm in favor of adding support for dot notation to the Event API, since I believe it will reduce confusion overall. Since I recently did work to formally define the syntax of the square-bracket syntax along with parser changes to handle ambiguous inputs, I'll add this to my back-burner.

branchnetconsulting commented 5 years ago

Ah, perhaps more specifically my issue was when using plain regex field extraction in grok like:

grok { match => [ "full_log", "\A%{SYSLOGTIMESTAMP} %{HOSTNAME} %{LOGLEVEL} %{PROG}[%{NUMBER}]: (?[0-9]+:[0-9]): [^:]+:%{WORD}:(?[0-9a-f]{8}): %{GREEDYDATA:apm.message}" ] }

Logstash fails if I try to replace

(?[0-9]+:[0-9])

with

(?<[apm][code]>[0-9]+:[0-9])

At least that was the case with the Logstash 6.4.0 instance I was using at the time.

Anyway thanks for you response! Kevin Branch

On Wed, Mar 20, 2019 at 6:33 PM Ry Biesemeyer notifications@github.com wrote:

I had to use . notation for the field in the grok line since [] is not allowed there

🤔the square-bracket field-reference implementation is acceptable in named grok captures:

input {

generator {

count => 1

message => "2019-03-20T22:17:38Z INFO: it works"

}

}

filter {

grok {

match => {

  "message" => "%{TIMESTAMP_ISO8601:[log][timestamp]} %{LOGLEVEL:[log][level]}:% {GREEDYDATA:[log][data]}"

}

}

}

output {

stdout { codec => rubydebug }

}

Produces:

{

  "sequence" => 0,

       "log" => {

        "level" => "INFO",

         "data" => " it works",

    "timestamp" => "2019-03-20T22:17:38Z"

},

   "message" => "2019-03-20T22:17:38Z INFO: it works",

"@timestamp" => 2019-03-20T22:19:42.551Z,

      "host" => "castrovel.local",

  "@version" => "1"

}


That said, I'm in favor of adding support for dot notation to the Event API, since I believe it will reduce confusion overall. Since I recently did work to formally define the syntax of the square-bracket syntax along with parser changes to handle ambiguous inputs, I'll add this to my back-burner.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/elastic/logstash/issues/8772#issuecomment-475054916, or mute the thread https://github.com/notifications/unsubscribe-auth/AMj33ijS--1vCCTnisvmX1q6A8Nudsg7ks5vYrccgaJpZM4QwNzr .

krizb8 commented 5 years ago

But syntax: (?<MYFIELD:[apm][code]>[0-9]+:[0-9]) seems to work.

branchnetconsulting commented 5 years ago

Interesting, in your example is MYFIELD literal or am I supposed to substitute something for it? The name of my field is already spelled out with [apm][code] so I'm not sure what to make of MYFIELD.

Thanks! Kevin

On Thu, Apr 4, 2019 at 9:03 AM krizb8 notifications@github.com wrote:

But syntax: (?<MYFIELD:[apm][code]>[0-9]+:[0-9]) seems to work.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/elastic/logstash/issues/8772#issuecomment-479887707, or mute the thread https://github.com/notifications/unsubscribe-auth/AMj33u2SW9PO01Oii_4Xy-5jhUXd4AFOks5vdfgNgaJpZM4QwNzr .

krizb8 commented 5 years ago

Hi, I'm not quite sure. I take it as something like temporal variable, so you can choose the name. I found this soltion here: https://github.com/elastic/ecs/issues/39 Regards, Bohumil


From: Kevin Branch notifications@github.com Sent: Thursday, April 4, 2019 5:37 PM To: elastic/logstash Cc: Bohumil Kříž; Comment Subject: Re: [elastic/logstash] [RFC] Use dot notation for field references. (#8772)

Interesting, in your example is MYFIELD literal or am I supposed to substitute something for it? The name of my field is already spelled out with [apm][code] so I'm not sure what to make of MYFIELD.

Thanks! Kevin

On Thu, Apr 4, 2019 at 9:03 AM krizb8 notifications@github.com wrote:

But syntax: (?<MYFIELD:[apm][code]>[0-9]+:[0-9]) seems to work.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/elastic/logstash/issues/8772#issuecomment-479887707, or mute the thread https://github.com/notifications/unsubscribe-auth/AMj33u2SW9PO01Oii_4Xy-5jhUXd4AFOks5vdfgNgaJpZM4QwNzr .

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/elastic/logstash/issues/8772#issuecomment-479948917, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AtpCKvcYxyXf_7KoKZ5_EuxMhNsYYLUFks5vdhxEgaJpZM4QwNzr.

branchnetconsulting commented 5 years ago

Thanks Bohumil! I will want to try that out. I appreciate that you shared the link to the posting where you found it, too.

Kevin

On Fri, Apr 5, 2019 at 2:45 AM krizb8 notifications@github.com wrote:

Hi, I'm not quite sure. I take it as something like temporal variable, so you can choose the name. I found this soltion here: https://github.com/elastic/ecs/issues/39 Regards, Bohumil


From: Kevin Branch notifications@github.com Sent: Thursday, April 4, 2019 5:37 PM To: elastic/logstash Cc: Bohumil Kříž; Comment Subject: Re: [elastic/logstash] [RFC] Use dot notation for field references. (#8772)

Interesting, in your example is MYFIELD literal or am I supposed to substitute something for it? The name of my field is already spelled out with [apm][code] so I'm not sure what to make of MYFIELD.

Thanks! Kevin

On Thu, Apr 4, 2019 at 9:03 AM krizb8 notifications@github.com wrote:

But syntax: (?<MYFIELD:[apm][code]>[0-9]+:[0-9]) seems to work.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/elastic/logstash/issues/8772#issuecomment-479887707 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AMj33u2SW9PO01Oii_4Xy-5jhUXd4AFOks5vdfgNgaJpZM4QwNzr

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub< https://github.com/elastic/logstash/issues/8772#issuecomment-479948917>, or mute the thread< https://github.com/notifications/unsubscribe-auth/AtpCKvcYxyXf_7KoKZ5_EuxMhNsYYLUFks5vdhxEgaJpZM4QwNzr

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/elastic/logstash/issues/8772#issuecomment-480166510, or mute the thread https://github.com/notifications/unsubscribe-auth/AMj33rcBnGPx6Hjh8OyVZnOTIxHFH3Oyks5vdvEJgaJpZM4QwNzr .

Supermathie commented 3 years ago

The fact that this behaviour differs substantially between the kibana grok debugger and logstash is very confusing and greatly devalues the grok debugger, unfortunately: https://discuss.elastic.co/t/interpolation-in-string-should-be-happening-but-isnt/266275/2?u=supermathie

maltewhiite commented 2 years ago

what is the difference between foo.bar and [foo][bar]?

maederm commented 2 years ago

@maltewhiite

what is the difference between foo.bar and [foo][bar]?

foo.bar in logstash is just a single field {"foo.bar": "xyz"} [foo][bar] is a nested field {"foo": { "bar": "xyz" }}

AFAIK for elasticsearch search purposes it doesn't make a difference. But if you look at the _source field it will have a different content.

yaauie commented 2 years ago

what is the difference between foo.bar and [foo][bar]?

Elasticsearch will interpret either as a nested structure, but inside Logstash the field foo.bar is a flat literal key, while [foo][bar] actually is a nested structure.

Using the dot notation, the following is perfectly valid in Logstash because they are just flat keys, but Elasticsearch will be unable to unflatten the structure into something meaningful:

{
  "somefield": "string value",
  "somefield.subfield": "another string value"
}

Using the nested structure allows Logstash to avoid sending this type of invalid structures to Elasticsearch.