aws / aws-for-fluent-bit

The source of the amazon/aws-for-fluent-bit container image
Apache License 2.0
462 stars 134 forks source link

EC2 IMDS errors upon launch #333

Open byrneo opened 2 years ago

byrneo commented 2 years ago
### Describe the question/issue Noticing some errors appearing when fluentbit launches ``` [error] [src/flb_network.c:224 errno=9] Bad file descriptor ``` ``` [error] [http_client] broken connection to 169.254.169.254:80 ? ``` ``` [error] [http_client] broken connection to 169.254.169.254:80 ? AWS for Fluent Bit Container Image Version 2.23.3[2022/04/21 09:58:51] [ Error] epoll_ctl: Bad file descriptr, errno=9 at /tmp/fluent-bit-1.8.15/lib/monkey/mk_core/mk_event_epoll.c:136 ``` these errors appear a few times upon startup but don't cause the pod to crash. ### Configuration

Fluent Bit Log Output

Fluent Bit v1.8.15
* Copyright (C) 2015-2021 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/04/21 09:58:50] [ info] [engine] started (pid=1)
[2022/04/21 09:58:50] [ info] [storage] version=1.1.6, initializing...
[2022/04/21 09:58:50] [ info] [storage] root path '/var/fluent-bit/state/flb-storage/'
[2022/04/21 09:58:50] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2022/04/21 09:58:50] [ info] [storage] backlog input plugin: storage_backlog.8
[2022/04/21 09:58:50] [ info] [cmetrics] version=0.2.2
[2022/04/21 09:58:50] [ info] [input:systemd:systemd.3] seek_cursor=s=7028adf2155a4b3ca09a2a342ca71203;i=ffa... OK
[2022/04/21 09:58:50] [ info] [input:storage_backlog:storage_backlog.8] queue memory limit: 4.8M
[2022/04/21 09:58:50] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2022/04/21 09:58:50] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2022/04/21 09:58:50] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2022/04/21 09:58:50] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2022/04/21 09:58:51] [ warn] [net] io_read #119 timeout after 1 seconds from: 169.254.169.254:80
[2022/04/21 09:58:51] [error] [src/flb_network.c:224 errno=9] Bad file descriptor
[2022/04/21 09:58:51] [error] [http_client] broken connection to 169.254.169.254:80 ?
AWS for Fluent Bit Container Image Version 2.23.3[2022/04/21 09:58:51] [  Error] epoll_ctl: Bad file descriptr, errno=9 at /tmp/fluent-bit-1.8.15/lib/monkey/mk_core/mk_event_epoll.c:136
[2022/04/21 09:58:51] [ info] [imds] to use IMDSv2, set --http-put-response-limit to 2
[2022/04/21 09:58:51] [ warn] [imds] falling back on IMDSv1
[2022/04/21 09:58:52] [ warn] [net] io_read #121 timeout after 1 seconds from: 169.254.169.254:80
[2022/04/21 09:58:52] [error] [src/flb_network.c:224 errno=9] Bad file descriptor
[2022/04/21 09:58:52] [error] [http_client] broken connection to 169.254.169.254:80 ?
[2022/04/21 09:58:52] [  Error] epoll_ctl: Bad file descriptor, errno=9 at /tmp/fluent-bit-1.8.15/lib/monkey/mk_core/mk_event_epoll.c:136
[2022/04/21 09:58:52] [ info] [imds] to use IMDSv2, set --http-put-response-limit to 2
[2022/04/21 09:58:52] [ warn] [imds] falling back on IMDSv1
[2022/04/21 09:58:53] [  Error] epoll_ctl: Bad file descriptor, errno=9 at /tmp/fluent-bit-1.8.15/lib/monkey/mk_core/mk_event_epoll.c:136
[2022/04/21 09:58:53] [ warn] [net] io_read #123 timeout after 1 seconds from: 169.254.169.254:80
[2022/04/21 09:58:53] [error] [src/flb_network.c:224 errno=9] Bad file descriptor
[2022/04/21 09:58:53] [error] [http_client] broken connection to 169.254.169.254:80 ?
[2022/04/21 09:58:53] [ info] [imds] to use IMDSv2, set --http-put-response-limit to 2
[2022/04/21 09:58:53] [ warn] [imds] falling back on IMDSv1

Fluent Bit Version Info

Fluent Bit v1.8.15

AWS for Fluent Bit Container Image Version 2.23.3

Cluster Details

Application Details

Steps to reproduce issue

Related Issues

PettitWesley commented 2 years ago

Does Fluent Bit function normally and successfully send logs after startup? Does these errors only occur on startup?

[2022/04/21 09:58:51] [error] [src/flb_network.c:224 errno=9] Bad file descriptor [2022/04/21 09:58:51] [error] [http_client] broken connection to 169.254.169.254:80 ?

Both of these errors are almost certainly the same root error- first the core network library logs the "Bad file descriptor" message, then the http client logs that thus the connection is broken. 169.254.169.254 is the EC2 IMDS IP. Notice the lines after this about setting a hop limit.

What's happening here is that when each AWS plugin instance is initialized, each one must initialize its credential providers. So it will go through the standard chain of AWS credential sources, including EC2 IMDS, and look for creds. This will happen for each AWS output instance. Hence, you probably got one error message per output instance. For the EC2 provider, it tries IMDS version 2 first: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html

If this fails it falls back to IMDS v1 style requests, where the auth token is omitted.

So I think what is happening here is expected. We wish the errors here were more clear to prevent confusion. @matthewfala Did I miss anything and can you think of any ways to improve the error messaging here?

matthewfala commented 2 years ago

That's right, @PettitWesley. We're thinking that the issue is a combination of the following:

  1. hop limit is set to 1
  2. IMDSv1 is disabled (which is good)

AWS recommends using IMDSv2, so in order to do that, you'll need to set hop limit to 2 or greater so network within the container can access the IMDS endpoint properly: https://github.com/aws/aws-for-fluent-bit/issues/259#issuecomment-970862321

If you don't want to go through the trouble of increasing the hop limit, you can also enable IMDSv1, in which case it should be detected and used by Fluent Bit.

byrneo commented 2 years ago

Sorry for the late response @PettitWesley @matthewfala . Yes: FluentBit did indeed appear to function normally despite the startup errors.

I've made a bunch of changes in my environment since creating this issue: one of which was to use IRSA with Fluentbit (previously i had been using an IAM instance role/profile for the ec2 host). I can't be 100% certain that made the difference, but i no longer see the errors during startup any more.

vkadi commented 1 year ago

@PettitWesley @matthewfala I have been struggling with the IMDS related issues , I am using the latest image 2.31.2

[2023/02/22 22:03:10] [error] [net] connection #44 timeout after 10 seconds to: 169.254.169.254:80
[2023/02/22 22:03:10] [error] [filter:aws:aws.0] connection initialization error
[2023/02/22 22:03:10] [error] [filter:aws:aws.0] Could not retrieve ec2 metadata from IMDS
[0] dummy: [1677103380.297254617, {"message"=>"dummy"}]

This is what I have in configmap


[INPUT]
    Name dummy
    Tag dummy

[FILTER]
    Name aws
    Match *
    imds_version v2
    az true
    ec2_instance_id true
    ec2_instance_type true
    private_ip true
    ami_id true
    account_id true
    hostname true
    vpc_id true

[OUTPUT]
    Name stdout
    Match *

I tried changing the hop count to 2 , snip from the ec2 describe

              MetadataOptions": {
                        "State": "applied",
                        "HttpTokens": "optional",  --> tried even with required
                        "HttpPutResponseHopLimit": 2,
                        "HttpEndpoint": "enabled",
                        "HttpProtocolIpv6": "disabled",
                        "InstanceMetadataTags": "disabled"
                    }

I am trying to use this metadata plugin to enrich the logs for the instance_id in specific , is there something I am missing ? what is required to be set from ec2 side to get this https://docs.fluentbit.io/manual/pipeline/filters/aws-metadata to work

PettitWesley commented 1 year ago

@vkadi that should work... what network setup are your containers running in? Can you try ssh/kubectl exec into the pod and see if you can reach IMDS via curl?

vkadi commented 1 year ago

@PettitWesley I am running this on a EKS cluster and from pods I am not able to access the metadata

bash-4.2# curl http://169.254.169.254/latest/meta-data/
curl: (28) Failed to connect to 169.254.169.254 port 80 after 129614 ms: Couldn't connect to server
PettitWesley commented 1 year ago

@vkadi then something about your network configuration is blocking access. I am not sure what. I know there are some CNI plugins that will block link local IP addresses from pods, which would block IMDS.

vkadi commented 1 year ago

@PettitWesley By enabling "hostNetwork: true" I was able to access the IMDS on fluentbit pod as mentioned here in this doc - https://docs.fluentbit.io/manual/pipeline/filters/kubernetes