Discuss Append fields - Githubissues

guyboertje commented 6 years ago

One use of append fields is to treat a section of text that has delimiters in it, that are the same as later delimiters, as a single unit of text.

An example Imagine you have a first name and last name followed by a telephone number. e.g. preamble Jake Landis 816 555 6675 postamble

Obviously, one can give individual fields for each group and then use a later filter to join them back together with a custom separator: %{first} %{last} %{nac} %{sac} %{snum}

One can use append fields to recreate the original text in logical groups: %{name} %{+name} %{phone} %{+phone} %{+phone} Clearly, for fidelity, when recreating the original text the delimiters must be added back.

And then there are dates. I believe these to be a special case. I have always found it silly to isolate the date time elements and then join them together and have a date filter reparse it again.

I think we should consider a change to the Date filters in all three places were the user can specify the fields from which a date can be constructed.

For a text timestamp "2018-06-27T15:42:59+02:00", we can dissect it as %{Y}-%{M}-%{D}T%{h}:%{m}:%{s}+%{tz} or %{[ts][Y]}-%{[ts][M]}-%{[ts][D]}T%{[ts][h]}:%{[ts][m]}:%{[ts][s]}+%{[ts][tz]} as namespaced. Then have the Date filter, explicitly or not, use those fields to build a date.

What other use cases are there for append fields?

jakelandis commented 6 years ago

when recreating the original text the delimiters must be added back.

I don't think this is always the case. For example, maybe you want the name as Landis,Jake , or the phone number as 816-555-6675 or 8165556675 (then store as long for efficiency). This is why the user specified separator was introduced.

And then there are dates. I believe these to be a special case

I agree that it is a bit of special case, but would also argue that probably pretty rare. ISO8601 seems to be a pretty popular standard, which shouldn't need dissection and is supported by the data filter/processor.

I think we should consider a change to the Date filters in all three places were the user can specify the fields from which a date can be constructed.

I don't think it's a bad idea, just concerned with need (will it be used, or is ISO8601 + one-off's sufficient), and possibly the configuration clarity.

guyboertje commented 6 years ago

when recreating the original text the delimiters must be added back The pertinent part is "recreating the original text".

Another example of this would be a CSV like string. Landis, Jake, 816-555-6675 If the user wanted to view this as two fields, a name and a phone number, the append field recreates the name in its original form.

In my opinion, instead of a user specified separator, we should direct the user to join two normal fields using a separator of their choice using add_field and remove_field which they can specify in the same filter.

Dates: any date that can be extracted as is without needing append fields to rebuild it, is fine and not a concern. I was trying to illustrate that it seems wasteful to have the user supply date format tokenisation in dissect and then again in a date filter/processor.

jakelandis commented 6 years ago

The pertinent part is "recreating the original text".

I think there are still some options to achieve this with the user specified separator. Given the example:

Landis, Jake, 816-555-6675

Could be recreated using two characters , (comma space) as the separator. However there are cases like Landis:Jake, 816-555-6675 which would be harder to reconstruct with the appends where the delimiter is different, but I would consider that more edge case and would require manually joining them like you describe.

The motivation for having a user defined separator is as defined in my prior comment. Also, using the delimiter as the separator breaks my mental model of how dissect works. I think of the patterns as defining everything i don't want and the keys as everything in-between. Preserving the delimiters, but only for the append seems odd.

This contrived example (also an edge case not worth worry about...but illustrates my point) when using the delimiter as a separator:

Input: foo,bar----baz lol
Pattern: %{+a/1},%{+a/3}----%{+a/2} %{b}
Result: foo----baz,bar

^^ was tested in Logstash. The rule in use here isn't completely clear and imo would be difficult to explain in documentation.

For the reasons on the above comment (not the contrived example here), I believe that the ability to specified a separator is must for the specification.

Since the specification is the minimal requirements, implementations are free to add implementation specific options (as long they don't conflict with the spec). For example, Logstash can also do type conversion, which is good feature, but not part of the spec. I feel that allowing the delimiter as the separator is also perfectly fine, but not a minimal requirement.

@ph - thoughts ?

ph commented 6 years ago

@jakelandis @guyboertje To be honest when I've created the code to handle append fields I was surprised by the behavior of automatic separator. I thought they were a bit too magic and most of them could be replaced by using a custom string interpolation instead to create the final appended field. But I understand the flexibility to be able to define this merge as part of the tokenizer string.

What if we could define the separator as part of the append field syntax instead of making it a global setting?

Something like this:

%{+a/1,}

In the above case I would expect , to prefix the value of a

jakelandis commented 6 years ago

I think the only things inside the key %{ } should be the key name and modifiers. Adding a value to insert into the value of the result as part of the key seems odd.

If we did include the append value inside the key, imo, it could get confusing. For example how do you add ", " (comma space) inside the key. Whitespace inside the key is hard to read (looks like a typo) and hard to validate, so you would likely need to place the separator value inside a second thing, like quotes or square brackets. With out that second thing, it could be difficult to differentiate the key name from the appending string.

It also seems like a an edge case where you would want different append characters across keys...meaning that i believe a global separator is sufficient.

I would argue that adding in something like quotes or square brackets would be required to express the separator inside the key, and this is possible. However, every thing we add to the pattern starts to diminish the simpleness of dissect. In this case I think the external setting is preferred.

elastic / dissect-specification

Discuss Append fields #4