elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
14.21k stars 3.5k forks source link

tagging inconsistencies #6184

Open colinsurprenant opened 7 years ago

colinsurprenant commented 7 years ago

The tagging concept and the tags field usage would benefit some thoughts and a properly defined API for the future (6.0?).

Some problems have been identified in #6142 and are being solved in #6177.

There is one inconsistency issue I'd like to discuss now to see if this is something we should fix in a minor/bugfix release.

but

Should we also do deduplication in the config add_tag ? if so, are there chances this break current configurations?

Otherwise, the larger problem here is that the tags field has never been properly considered a "reserved" field with a "defined" API, like @timestamp for example. so it is totally possible for anyone to place arbitrary values in the tags field but on the other hand, a lot of plugins rely on the existence and "validity" of the tags field.

I'd like open up the discussion on the best way to address this moving forward. Is it to simply define a new @tags field with a clear API? this would require changes is configurations? code sweep across plugins would also be necessary.

talevy commented 7 years ago

I'd like to postpone any changes into the api around this (strict or lenient) to 6.0. or do you expect deduping to belong in 5.x even though it is not happening today? does anyone ever care there are two of the same tag in the list? usually people just check existence of at least one.

guyboertje commented 7 years ago

My 2 cents:

We should not de-duplicate tags, they are not a Set.

For example, if a user has add_tag options in all of the filters, by allowing repeated tags the user can see which "path" the event took through the config. So if the config filters are 'grok', 'date', 'date', 'mutate', 'mutate' and the user add_tag with the same strings - we should get 'grok','date','date','mutate','mutate' instead of 'grok','date','mutate'

On the flip side is there any utility in have 2 or more _grok_parse_failure tags (i.e. tags added in plugin code)?

Also when a user does remove_tag and there is dups of that tag - should the code only remove the last of the dups?

The docs for add_tag are silent about duplicating tags.

How about we decide (and document) that if a user specifies add_tag multiple times in the config with the same value then we honour that but a plugin dev will strive to not add duplicates (via event.tag) unless there is a need to do so?

andrewvc commented 7 years ago

I'm in favor of de-duping.

I think the find the path functionality is better served by the user specifying a specific tag per grok filter, otherwise there's no telling which one it came from.

I think that de-clutters things and is more in-line with what a tag is in most applications, that is to say a member of a set.

guyboertje commented 7 years ago

I just checked with Jordan as to whether there was any historical reason for duplicates. There are none - in fact he said tags were intended to behave as a Set.

I am OK with removing duplicates as there is no historical precedent for them.

As for:

Is it to simply define a new @tags field with a clear API

I have no input on this.

jsvd commented 7 years ago

Does that mean that an operation like event.set("tags", ["hey", "hey"]) should silently result in event.get("tags") == "hey" ? How far should we protect the event.tag("hey") when someone uses event.set ?

guyboertje commented 7 years ago

There is also a subtle problem, not currently coded for, if an input or codec builds a Hash before creating the event from it and that hash has a tags field with a Hash value. Boom if a filter tries to add a tag.