fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.73k stars 1.56k forks source link

Ability to add more specific pod annotations #777

Closed yann-soubeyrand closed 5 years ago

yann-soubeyrand commented 5 years ago

Is your feature request related to a problem? Please describe. When making use of the Kubernetes filter plugin, one can use a pod's annotation to specify a parser which Fluent Bit could use to parse this pod's logs. One can also use a pod's annotation to exclude this pod's logs. However, a pod can be composed of several containers each of which can use different log formats. Moreover, the containers can log to their stdout and stderr streams which can also use different log formats.

Describe the solution you'd like To solve this problem, Fluent Bit could support more specific pod annotations. For example, a pod could specify the following annotations:

fluentbit.io/parser/<container_name>/<(stdout|stderr)>
fluentbit.io/exclude/<container_name>/<(stdout|stderr)>

The order of precedence for the pod's annotations could be (from the highest precedence to the lowest):

fluentbit.io/parser/<container_name>/<(stdout|stderr)>
fluentbit.io/parser/<container_name>
fluentbit.io/parser/<(stdout|stderr)>
fluentbit.io/parser

Describe alternatives you've considered As a workaround, I send raw logs to Fluentd and do static parsing there (static meaning without making use of pod's annotations).

Additional context What I'm trying to accomplish is parsing and filtering Kubernetes logs with Fluent Bit prior to sending it to Elasticsearch or Fluentd. See https://github.com/fluent/fluent-bit/issues/776 for more context.

donbowman commented 5 years ago

+1. This occurs when using istio. It injects a sidecar that uses JSON format logging, where the 'main' container does not.

Kubernetes allows a single slash in an annotation, so the above suggested format is not legal.

a qualified name must consist of alphanumeric characters, '-', '_' or '.', 
and must start and end with an alphanumeric character (e.g. 'MyName', 
or 'my.name',  or '123-abc', regex used for validation is 
'([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]') with an optional 
DNS subdomain prefix and '/' (e.g. 'example.com/MyName')

So it could use a dash if we dropped the stdout/stderr: fluentbit.io/parser-container

I'm not sure how you'd get the stdout/stderr since the log file might be merged.

donbowman commented 5 years ago

for comment and test, https://github.com/fluent/fluent-bit/pull/789

I made this bit fluentbit.io/parser[-container]: parsername.

yann-soubeyrand commented 5 years ago

In my opinion, being able to specify a different parser for stdout and stderr is a must: Apache access logs and error logs are fundamentally different for example. If we cannot use '/' as a delimiter, maybe we can use '_' as it is not allowed in container names if I remember correctly.

donbowman commented 5 years ago

can't use _, its part of the current parsing strategy.

i could make it super complex like:

fluentbit.io/parser: { container: { stdout_parser: x, stderr_parser: y }

or i can add another field on the end:

fluentbit.io/parser[-container][-stream]

i guess (?)

other suggestions?

it would be nicer if the parser had the logic (like grok) so that if (start regex) then (end-regex1 else end-regex2). because the reality is the entire stdout stream is not in the same format either (nor is stderr).

On Fri, 21 Sep 2018 at 16:23, Yann Soubeyrand notifications@github.com wrote:

In my opinion, being able to specify a different parser for stdout and stderr is a must: Apache access logs and error logs are fundamentally different for example. If we cannot use '/' as a delimiter, maybe we can use '_' as it is not allowed in container names if I remember correctly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fluent/fluent-bit/issues/777#issuecomment-423660293, or mute the thread https://github.com/notifications/unsubscribe-auth/AE5OkybMVHSHoJtQerRGIjvJOaB2NwEJks5udUrbgaJpZM4WwoIH .

donbowman commented 5 years ago

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/identifiers.md

shows the rules.

I guess we could do fluentbit.io/parser[-stdout|stderr][-container] @ the cost of an ambiguity if there is a container called stdout or stderr in a pod.

edsiper commented 5 years ago

it's bad that we are restricted in the format by k8s spec...

I just did a quick look but what about:

fluentbit.io/parser.<stream>.<container>

where

edsiper commented 5 years ago

@bigkraig @homer6 @StevenACoffman, would you please review this proposal and provide some feedback from an usability point of view ?

donbowman commented 5 years ago

I don't think * is allowed. If I try to use '*' I get this:

The Pod "user-db-7dcc9649dc-b72rs" is invalid: metadata.annotations: Invalid value: "foo.bar.*.x": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')

So I can use all instead of *.

So fluentbit.io/parser.<all|stdout|stderr>.container

edsiper commented 5 years ago

@donbowman good point, yeah, 'all' sounds good

bigkraig commented 5 years ago

I went a similar path recently in the ALB Ingress controller. Should the container name come before the stream? It should go from least to most specific and I think having the option of not providing a stream makes sense.

I don’t know if you would configure parsers on things other than containers in an annotation, but if so you may want to add the ‘container’ word as an identifier as well

bigkraig commented 5 years ago

fluentbit.io/parsers.containers.CONTAINERNAME

fluentbit.io/parsers.containers.CONTAINERNAME.stdout

Sorry for the lack of quotes. I’m on a phone on my way to a soccer game 😃

donbowman commented 5 years ago

i don't think it makes sense to have a log parser on anything other than the container-level or pod-level (what else would print logs? logs are inherently per container).

so it sounds like fluentbit.io/parsers[.CONTAINERNAME][.STREAM], with the restriction that if you want per-stream you must also specify per-container.

On Sat, 22 Sep 2018 at 13:37, Kraig Amador notifications@github.com wrote:

fluentbit.io/parsers.containers.CONTAINERNAME

fluentbit.io/parsers.containers.CONTAINERNAME.stdout

Sorry for the lack of quotes. I’m on a phone on my way to a soccer game 😃

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fluent/fluent-bit/issues/777#issuecomment-423760759, or mute the thread https://github.com/notifications/unsubscribe-auth/AE5OkzWwQZCUtP2DYGk6IG4cOV5xQtyfks5udnVmgaJpZM4WwoIH .

donbowman commented 5 years ago

Getting the stream is not that simple. flb_filter_do() calls the filters, but the msgpack there has only 2 entries(time, log). It no longer has access to the raw docker log (which had stream: stdout in it).

I tried adding Decode_Field_As escaped stream to the docker parser, but we still only get the 2.

Still looking.

Globegitter commented 5 years ago

Just want to add that being able to differentiate between stdout/stderr is also very important for nginx logs as they differentiate there fundamentally and there is no way to control the error logs of nginx.

donbowman commented 5 years ago

OK I have committed this change to the PR https://github.com/fluent/fluent-bit/pull/789

this supports:

fluentbit.io/parser[.container][.stream]

as to the early comment about nginx, this will work for nginx in kubernetes.

donbowman commented 5 years ago

there's a side-affect of this. Elasticsearch does not like a mix - match of fluentbit.io/parser: foo fluentbit.io/parser.bob: foo

It ends up wanting, in the 2nd case, to parse that path.

curl -s 'elasticsearch.logging:9200/logstash-2018.09.24/_mapping/flb_type/field/kubernetes.annotations.fluent*' | python3 -m json.tool
{
    "logstash-2018.09.24": {
        "mappings": {
            "flb_type": {
                "kubernetes.annotations.fluentbit.io/parser.keyword": {
                    "full_name": "kubernetes.annotations.fluentbit.io/parser.keyword",
                    "mapping": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "kubernetes.annotations.fluentbit.io/parser": {
                    "full_name": "kubernetes.annotations.fluentbit.io/parser",
                    "mapping": {
                        "io/parser": {
                            "type": "text",
                            "fields": {
                                "keyword": {
                                    "type": "keyword",
                                    "ignore_above": 256
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
edsiper commented 5 years ago

I don't think that's an issue since labels are a better selector for such cases

On Tue, Sep 25, 2018, 07:05 Don Bowman notifications@github.com wrote:

there's a side-affect of this. Elasticsearch does not like a mix - match of fluentbit.io/parser: foo fluentbit.io/parser.bob: foo

It ends up wanting, in the 2nd case, to parse that path.

curl -s 'elasticsearch.logging:9200/logstash-2018.09.24/_mapping/flb_type/field/kubernetes.annotations.fluent*' | python3 -m json.tool { "logstash-2018.09.24": { "mappings": { "flb_type": { "kubernetes.annotations.fluentbit.io/parser.keyword": { "full_name": "kubernetes.annotations.fluentbit.io/parser.keyword", "mapping": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "kubernetes.annotations.fluentbit.io/parser": { "full_name": "kubernetes.annotations.fluentbit.io/parser", "mapping": { "io/parser": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } } } } }

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/fluent/fluent-bit/issues/777#issuecomment-424334811, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWkNi8bG8tTg5Qv7vkoPXYIccwrU_Teks5ueiovgaJpZM4WwoIH .

donbowman commented 5 years ago

its absolutely an issue. elastic stops working. it confuses it that a field is sometimes a scalar and sometimes an object.

you see the mapping: io/parser above?

So i think we cannot use a 'dot' as a delimiter in this way.

this is a more general issue, e.g. if anyone (other than us) has a field which is somtimes A, sometimes A.B, it will lock up. fluentbit keeps trying to log it, elastic keeps returning an error.

{"took":604,"errors":true,"items":[{"index":{"_index":"logstash-2018.09.25","_type":"flb_type","_id":"CbqPDmYBf8jg_1jQcB1A","status":400,"error":{"type":"mapper_parsing_exception","reason":"object mapping for [kubernetes.annotations.fluentbit.io/parser] tried to parse field [fluentbit.io/parser] as object, but found a concrete value"}}},{"index":{"_index":"logstash-2018.09.25","_type":"flb_type","_id":"CrqPDmYBf8jg_1jQcB1A","status":400,"error":{"type":"mapper_parsing_exception","reason":"object mapping for [kubernetes.annotations.fluentbit.io/parser] tried to parse field [fluentbit.io/parser] as object, but found a concrete value"}}},{"index":{"_index":"logstash-2018.09.25","_type":"flb_type","_id":"C7qPDmYBf8jg_1jQcB1A","status":400,"error":{"type":"mapper_parsing_exception","reason":"object mapping for [kubernetes.annotations.fluentbit.io/parser] tried to parse field [fluentbit.io/parser] as object, but found a concrete value"}}},{"index
donbowman commented 5 years ago

Apologies for the image, but i'm lazy :)

this shows the issue, the fact the annotation is sometimes a scalar, sometimes an object.

image

donbowman commented 5 years ago

A work-around is:

    [FILTER]
        Name record_modifier
        Match *
        Remove_key kubernetes.annotations.fluentbit.io/parser

so, what is the suggested solution?

One solution is:

fluentbit.io/stdout-parser[-containername] fluentbit.io/stderr-parser[-containername] fluentbit.io/parser: name

Other suggestions?

StevenACoffman commented 5 years ago

Did you mean to order them by specificity:

I have no strong opinion on delimiter choice, but is the convention to mimic urls and use slashes? For instance:

donbowman commented 5 years ago

only a single / is allowed in an annotation in Kubernetes.

the first character of a pod name, container name must be alpha.

there is an ambiguity w/ putting the stdout/stderr on the end, e.g. I can have a container called 'don.stdout', its valid.

so fluentbit.io/parser[.|-|_][container] is ok fluentbit.io/parser[.|-|_][container][.|-|_][stderr|stdout] is ambiguous.

Also, elasticsearch is underwhelmed by having a 0 or more dots.

Thus i was thinking to put the stream name as part of the tag (fluentbit.io/stderr-parser) instead.

comments?

a qualified name must consist of alphanumeric characters, '-', '_' or '.', 
and must start and end with an alphanumeric character (e.g. 'MyName', 
or 'my.name',  or '123-abc', regex used for validation is 
'([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]') with an optional 
DNS subdomain prefix and '/' (e.g. 'example.com/MyName')
StevenACoffman commented 5 years ago

Are you suggesting supporting a use case where container name is left empty, but a stream is specified? I had assumed that container name was required, but not that both stream and container were independently optional.

Not trying to be difficult, but my reading from Label, selector, and annotation conventions

For instance, prefer service-account.kubernetes.io/name over kubernetes.io/service-account.name They appear to be recommending:

  • [stream.][container.]fluentbit.io/parser

However, if you have a container named stdout.don then that seems problematically ambiguous.

Would it be unambiguous to have something like:

donbowman commented 5 years ago

to keep backwards compatibility, we must keep: fluentbit.io/parser: parsername

We want to add container-only, and container+stream only. I'm not sure stream-only is needed.

would folks be happy w/ the container or stream on the left? it incrases the cost of parsing a bit (since its unanchored).

istio uses a yaml expression on the right:

kubernetes.annotations.sidecar.istio.io/status: {\"version\":\"2e0c897425ef3bd2729ec5f9aead7c0566c10ab326454e8e9e2b451404aee9a5\",\"initContainers\":[\"istio-init\"],\"containers\":[\"istio-proxy\"],\"volumes\":[\"istio-envoy\",\"istio-certs\"],\"imagePullSecrets\":null}
StevenACoffman commented 5 years ago

By the way I agree fluentbit.io/stderr-parser[.|-|_][container] also has the advantage of being unambiguous. The ergonomics of identifier placement on the left or the right are probably about the same, so then if it's more efficient, then:

I appreciate the meticulous thought and care you have spent on this. Thank you for explaining your reasoning.

donbowman commented 5 years ago

ok i will make this change and try it.

yann-soubeyrand commented 5 years ago

I understand that specifying only stream and not container add parsing difficulty but in my opinion it is the principal use case. It seems to me that a lot of pods consist only of one container logging with different format on stdout and stderr (think of web servers and web applications which output access logs on stdout and error logs on stderr).

Concerning the characters allowed in container names, the dot seems to be forbidden (kubectl refuses it) but a notable exception is kube-proxy or kube-dns (I don't remember which one) on a Kops cluster (I don't even know how they can be created).

As a matter of taste, I prefer having the container and stream at the right of the annotation and in specificity order. '.' and '-' being problematic, I would vote for '_' as the delimiter:

fluentbit.io/parser_<container>_<stream>
fluentbit.io/parser_<container>
fluentbit.io/parser_<stream>
fluentbit.io/parser

Is there a problem with '_'?

edsiper commented 5 years ago

I would suggest that in filter_kubernetes add an option to enable/disable the inclusion of Fluent Bit annotations, e.g:

[FILTER]
    Name kubernetes
    Include_Annotation  Off

default is On, so when using elasticsearch output just make sure the configmaps we provide that the flag turned off.

donbowman commented 5 years ago

OK change is made and committed. https://github.com/fluent/fluent-bit/pull/789

donbowman commented 5 years ago

I understand that specifying only stream and not container add parsing difficulty but in my opinion it is the principal use case. It seems to me that a lot of pods consist only of one container logging with different format on stdout and stderr (think of web servers and web applications which output access logs on stdout and error logs on stderr).

What I have implemented allows: a) only parser (e.g. fluentbit.io/parser: XXX) b) only stream (e.g. fluentbit.io/stderr-parser: XXX) c) only parser per container (e.g. fluentbit.io/parser-CONTAINER: XXX) d) parser per stream per container (e.g. fluentbit.io/stdout-parser-CONTAINER: XXXX)

I cannot put the stream @ the end due to ambiguity. A container is guaranteed to not have a - as the first character. No such guarantee on the last character.

StevenACoffman commented 5 years ago

I think @yann-soubeyrand meant that if you pick _ as the delimiter, it would allow you to unambiguously reorder container and stream in order of specificity, since container names can only match [a-z0-9]([-a-z0-9]*[a-z0-9])?. Even if your container name is don-stdout, these would be unambiguous:

I'm not sure there's any problem with that, nor with the current:

Questions:

yann-soubeyrand commented 5 years ago

I think @yann-soubeyrand meant that if you pick _ as the delimiter, it would allow you to unambiguously reorder container and stream in order of specificity, since container names can only match [a-z0-9]([-a-z0-9]*[a-z0-9])?.

Yes, that's what I meant ;-)

Questions:

* Are there severe performance impacts of one or the other?

I don't know what's the performance impact of each solution.

* Is one very confusing for end users?

I don't think so, but I find that specificity order is more intuitive.

* Or is it just personal preference?

By the way, it's a matter of taste ;-)

xanonid commented 5 years ago

@donbowman A side note about the ElasticSearch problem with changing tags: The problems could be solved by using appropriate dynamic mapping settings, c.f. https://github.com/uken/fluent-plugin-elasticsearch/issues/452

donbowman commented 5 years ago

so before i make any more changes, let me ask 2 questions:

  1. is what i've done not acceptable? e.g. is it ok as-is?
  2. if i make this last suggested change and use _, are we now happy and won't complain any more?

its a pain to do since it affects the docs and the docs are in a separate repo for some reason so i have to do 2 changes to 2 reports and submit 2 calibrated pull requests.

edsiper commented 5 years ago

Hi Don,

I will do a deep.review this weekend, modifying docs is not an issue..

On Fri, Sep 28, 2018, 13:25 Don Bowman notifications@github.com wrote:

so before i make any more changes, let me ask 2 questions:

  1. is what i've done not acceptable? e.g. is it ok as-is?
  2. if i make this last suggested change and use _, are we now happy and won't complain any more?

its a pain to do since it affects the docs and the docs are in a separate repo for some reason so i have to do 2 changes to 2 reports and submit 2 calibrated pull requests.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/fluent/fluent-bit/issues/777#issuecomment-425508056, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWkNp_dy6pmq_iCVLV2ENITRc1wUynsks5ufluFgaJpZM4WwoIH .

donbowman commented 5 years ago

is there any consensus on what i should do here? change? leave it?

edsiper commented 5 years ago

@donbowman the latest changes you did, which pattern of configuration is requiring ?

donbowman commented 5 years ago

As committed right now it is:

as options. The latter sets the proposed parser for both streams. Container is optional.

edsiper commented 5 years ago

[please do not make changes in the code to avoid extra work until we have consensus]

what about:

having all prefixed with parser avoid confusion in the future if we add more annotations.

comments ?

donbowman commented 5 years ago

does not work. A container could be called 'stderr' or 'stderr-foo'.

a pod and a container name can be any valid host name + a few chars. The rules are above. But effectively the first char cannot be anything other than alphanum, the rest can include other chars like -.

I believe the _ proposal mooted above also can work. Or a small run time hit and make the value be a json field (as in istio) and match within it.

edsiper commented 5 years ago

what about if we restrict that to make this work the container cannot be prefixed with stderr or stdout ?

donbowman commented 5 years ago

why? the current method works for all container names without restriction. I think the _ would as well.

Kubernetes has a bit of an opinion on this internal to it. Its log-file format:

shipping-notify-64db5dd76b-r9mb8_sock-shop_istio-init-d22b9937e84aca64ab211756e2f7e86c1150ebfb13f1f30a4cfe77abb91e1473.log

is effectively

__-.log since we don't need namespace or pod (its implicit in the meta data already), we can use the pattern of _ if you want to be consistent. its disallowed as a first-char in a container name.
edsiper commented 5 years ago

do you mean like this ?:

donbowman commented 5 years ago

or all _ should work too.

i don't think the 2nd last can be done. its again ambiguous. i think just: fluentbit.io/parser_stderr[_container] fluentbit.io/parser_stdout[_container] fluentbit.io/parser

if you want the 2nd last, i've currently got it set so that if stderr parser is not set it uses stdout, but we could create an 'all' stream:

fluentbit.io/parser_all_container.

i'll ask again, the current implementation, that's not ok? its a lot of semantic discussion over the order of where 'stdout' goes :)

donbowman commented 5 years ago

ok, from slack, proposal (final unless someone complains):

the first case is the current case, so no change. 2nd/3rd case the tag fluent-bit is picking up is "parser_std??", followed by optional container name. Last case is 'both streams' case

or, simplified:

There is no ambiguity since a container cannot start w/ _ nor - and since we are using parser[_stream], not parser[-stream]

comments please? I would like to finish this today.

yann-soubeyrand commented 5 years ago

Sorry to be insistent, but to me, it seems more logical to order annotation parts by specificity and to use _ everywhere to be consistent:

even if it forbids having a container named stderr or stdout (wouldn't it be weird anyway?).

donbowman commented 5 years ago

this would mean a container name starting w/ stderr/stdout would be treated incorrectly.

yann-soubeyrand commented 5 years ago

No, just two container names would be problematic: stdout and stderr. is forbidden in container names (to my knowledge). If a container name could contain , it would be problematic for the log filename nomenclature ;-)

donbowman commented 5 years ago

so why break this? the argument is that fluent-bit has 3 keys:

and then the value is separated by a -

why should we make a restriction on container names that is trivially worked around by using the -?

if that's the case, we can use what i currently have implemented, the stdout-parser, stderr-parser instead.

edsiper commented 5 years ago

@yann-soubeyrand

the thinking about the nomenclature is:

Fluent Bit annotation will have 4 components:

so config key metadata follows after a -, e.g: fluentbit.io/parser_stderr-mymetadata