Credentials are not retrieved from AWS IMDSv2 when running on EC2

SamuelDudley commented 3 years ago

Bug Report

Describe the bug

Credentials are not retrieved from AWS Instance Metadata Service v2 (IMDSv2) when running on EC2. This causes plugins that require credentials to fail (e.g.: cloudwatch).

To Reproduce

Steps to reproduce the problem:

Create an EC2 instance with metadata version 2 only selected on the Advanced Details section of the Configure Instance step. NB: I have used Amazon Linux 2 AMI (HVM), SSD Volume Type - ami-09f765d333a8ebb4b (64-bit x86) in this example
As I will be using the cloudwatch output to demonstrate this issue I have assigned a very loose role to the instance:
I created and assigned fully open security group to remove that as a potential issue.
Install Fluent Bit as per https://docs.fluentbit.io/manual/installation/linux/amazon-linux

Apply the following configuration:


[SERVICE]
flush        1
daemon       Off
log_level    info
parsers_file parsers.conf
plugins_file plugins.conf
http_server  Off
http_listen  0.0.0.0
http_port    2020
storage.metrics on

[INPUT] Name systemd Path /var/log/journal Buffer_Chunk_Size 32000 Buffer_Max_Size 64000

[OUTPUT] Name cloudwatch_logs Match * region ap-southeast-2 log_group_name testing log_stream_name bazz auto_create_group true



- Restart the service: `sudo service td-agent-bit restart`

### Expected behaviour
Expected fluent bit to obtain temporary credentials from the instance metadata service and forward the logs to cloudwatch.

### Observed behaviour
Fluent bit fails to obtain credentials and the cloudwatch stream is not created & logs are not sent.

### Your Environment
* Version used: Fluent bit v1.6
* Configuration: (see above)
* Environment name and version (e.g. Kubernetes? What version?): N/A
* Server type and version: AWS EC2 (t2.micro) IMDSv2 enabled and IMDSv1 disabled
* Operating System and version: Amazon Linux 2 (AMI: ami-09f765d333a8ebb4b)
* ~~Filters~~ and plugins: `cloudwatch` (output) `systemd` (input)

### Additional context
Firstly, thank you for this great bit of software 👍

In an AWS environment disabling IMDSv1 is considered best security practice due to the security venerability that it creates. We would like to follow this recommendation but currently can't with the issue described above.

I note that the [AWS Metadata filter](https://docs.fluentbit.io/manual/pipeline/filters/aws-metadata) has a option to allow a user to select between IMDSv1 and v2 and it appears that the code to retrieve the token and pass it in the metadata request header as required by IMDSv2 [is already implemented in the codebase](https://github.com/fluent/fluent-bit/blob/master/plugins/filter_aws/aws.c#L114) but is _not_ used [for obtaining credentials](https://github.com/fluent/fluent-bit/blob/master/src/aws/flb_aws_credentials_ec2.c#L255).

_NB:_ The above configuration works fine and without issue when IMDSv1 is enabled on the EC2 instance.

zandernelson commented 3 years ago

This exact same issue is affecting us with our fluent bit Kubernetes daemonset. We are using IMDSv2 on our EKS nodes and fluentbit is unable to communicate with our elastic search cluster. As a result, we have to turn off the AWS_Auth parameter.

This should be a high priority as this is a security risk for many users.

LukaszRacon commented 3 years ago

Check if you are affected by the hop limit - increase it to 2: aws-cli ec2 modify-instance-metadata-options --instance-id i-00000000000 --http-put-response-hop-limit 2

https://aws.amazon.com/about-aws/whats-new/2020/08/amazon-eks-supports-ec2-instance-metadata-service-v2/

IMDSv2 requires a PUT request to initiate a session to the instance metadata service and retrieve a token. By default, the response to PUT requests has a response hop limit (time to live) of 1 at the IP protocol level. However, this limit is incompatible with containerized applications on Kubernetes that run in a separate network namespace from the instance.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

SamuelDudley commented 3 years ago

Still an issue, nothing to do with the hop limit. The code to handle IMDSv2 simply is not used for obtaining credentials.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

SamuelDudley commented 3 years ago

Commenting to keep this issue alive as I cant edit / remove labels.

smithdebug commented 3 years ago

Hi, I try to run Fluent bit on a Windows server 2016, the Cloudwatch plugins seem unable to authenticate using the Instance Profile.

agup006 commented 3 years ago

Can we try this with 1.7.x and see if it reproducing?

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 3 years ago

This issue was closed because it has been stalled for 5 days with no activity.

PettitWesley commented 3 years ago

Sorry folks. This is a feature gap which I had meant to address late last year but then lost it with too many other higher priority feature requests and bugs.

We will get someone to work on this soon.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

shalevutnik commented 3 years ago

This issue is preventing using the S3 output plugin. The current workaround to use IMDSv1 is a security breach. Do you have an ETA to have it resolved?

PettitWesley commented 3 years ago

@shalevutnik Yes, I understand this is very important, but I am stretched very thin lately. Unfortunately I can't give a promise any exact ETA yet but I have gotten someone from my team assigned to start work on this soon.

matthewfala commented 3 years ago

Hi :wave: I am currently working on adding IMDSv2 support to AWS Fluent Bit plugins. Thank you for your patience. I will update you on the progress of this feature.

matthewfala commented 3 years ago

Just an update on the progress of IMDSv2 support. The unit tests and code have been written, and the tests are passing. We're going through a couple code reviews. Feel free to take a look at the PR https://github.com/fluent/fluent-bit/pull/4086. Releasing the code may take ~2 weeks. Thank you for your patience and input on the importance of this feature.

aqubo commented 3 years ago

Thank you for your help, will the new IMDSv2 support be new Fluent Bit image release, or we can upgrade from IMDSv1 to IMDSv2

matthewfala commented 3 years ago

Aqubo,you're welcome, and thanks for checking back. We are expecting to include IMDSv2 support in the next Fluent Bit release, 1.8.8. Will keep you updated.

ypicard commented 3 years ago

I have the same error here.

SamuelDudley commented 3 years ago

I have the same error here.

Hi, try using the version quoted in this comment: https://github.com/aws/aws-for-fluent-bit/issues/207#issuecomment-943694457

matthewfala commented 3 years ago

Thank you @SamuelDudley . IMDSv2 support is added in Fluent Bit version 1.8.8 and aws-for-fluent-bit v2.21.0. Please see the issue link Samuel copied: https://github.com/aws/aws-for-fluent-bit/issues/207#issuecomment-943694457

kdalporto commented 3 years ago

@PettitWesley Hi, I'm currently running into this same issue with 1.8.8. I've read through the various threads, but haven't had luck getting IMDS authentication to work. Does anything pop out with the below configuration that may be an issue?

{
    "State": "applied",
    "HttpTokens": "optional",
    "HttpPutResponseHopLimit": 2,
    "HttpEndpoint": "enabled",
    "HttpProtocolIpv6": "disabled"
}

^I've tried with 'HttpTokens: required' as well

[OUTPUT]
    Name s3
    Match *
    bucket xxxxxxxxx
    region us-gov-west-1
    use_put_object true
    total_file_size 1M
    upload_timeout 1m

[2021/11/10 17:02:10] [debug] [upstream] KA connection #180 to s3.us-gov-west-1.amazonaws.com:443 has been assigned (recycled)
[2021/11/10 17:02:10] [debug] [http_client] not using http_proxy for header
[2021/11/10 17:02:10] [debug] [aws_credentials] Requesting credentials from the env provider..
[2021/11/10 17:02:10] [debug] [aws_credentials] Retrieving credentials for AWS Profile default
[2021/11/10 17:02:10] [debug] [aws_credentials] Reading shared config file.
[2021/11/10 17:02:10] [debug] [aws_credentials] Shared config file /fluent-bit/.aws/config does not exist
[2021/11/10 17:02:10] [debug] [aws_credentials] Reading shared credentials file.
[2021/11/10 17:02:10] [error] [aws_credentials] Shared credentials file /fluent-bit/.aws/credentials does not exist
[2021/11/10 17:02:10] [error] [aws_credentials] Failed to retrieve credentials for AWS Profile default
[2021/11/10 17:02:10] [debug] [aws_credentials] Requesting credentials from the EC2 provider..
[2021/11/10 17:02:10] [debug] [aws_credentials] requesting credentials from EC2 IMDS
[2021/11/10 17:02:10] [debug] [upstream] KA connection #178 to 169.254.169.254:80 has been assigned (recycled)
[2021/11/10 17:02:10] [debug] [http_client] not using http_proxy for header
[2021/11/10 17:02:20] [debug] [aws_client] (null): http_do=0, HTTP Status: 503
[2021/11/10 17:02:20] [debug] [upstream] KA connection #178 to 169.254.169.254:80 is now available
[2021/11/10 17:02:20] [ warn] [imds] unable to evaluate IMDS version
[2021/11/10 17:02:20] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
[2021/11/10 17:02:20] [error] [signv4] Provider returned no credentials, service=s3
[2021/11/10 17:02:20] [error] [aws_client] could not sign request
[2021/11/10 17:02:20] [debug] [upstream] KA connection #180 to s3.us-gov-west-1.amazonaws.com:443 is now available
[2021/11/10 17:02:20] [error] [output:s3:s3.2] PutObject request failed

ypicard commented 3 years ago

Are you using kube2iam ?

kdalporto commented 3 years ago

Are you using kube2iam ?

No currently I'm trying to utilize the IAM role that is attached to the instance it's deployed on for the time being. I recall seeing your post in another thread, have you gotten kube2iam to work with the 1.8.8 image?

ypicard commented 3 years ago

Yes. I had to fidget around with the available versions and ended up with the following config to manually choose the deployed docker image:

repositories:
  ...
  - name: kube2iam
    url: https://jtblin.github.io/kube2iam/

releases:
  - name: kube2iam
    namespace: kube-system
    chart: kube2iam/kube2iam
    version: 2.6.0
    values:
      - image:
          tag: 0.10.11
        ...

PettitWesley commented 3 years ago

@matthewfala Can you help here

matthewfala commented 3 years ago

Hi @kdalporto. This is not the hops limit issue any more, since you have hops limits correctly set to 2 at it looks like from your error logs Fluent Bit is not having that problem. It seems like the IMDS may be unreachable. Is it possible for you to try to curl 169.254.169.254 on your instance?

curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"

This should return a token.

kdalporto commented 3 years ago

@matthewfala yes that returns a ~56 character token when running on the node instance where fluent-bit is running. I'm also able to manually upload objects to the destination bucket via the CLI. I currently have HttpTokens set to required.

matthewfala commented 3 years ago

That's strange. Your error message should only come up [imds] unable to evaluate IMDS version if the following request does not complete:

The following curl should return with a status code of 401 which indicates IMDSv2 availability.

curl -H "X-aws-ec2-metadata-token: INVALID" -v http://169.254.169.254/

It's not clear why this request is failing (not returning anything) (401 is expected).

kdalporto commented 3 years ago

That curl does indeed lead to a 401:

* About to connect() to 169.254.169.254 port 80 (#0)
*   Trying 169.254.169.254...
* Connected to 169.254.169.254 (169.254.169.254) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 169.254.169.254
> Accept: */*
> X-aws-ec2-metadata-token: INVALID
>
< HTTP/1.1 401 Unauthorized
< Content-Length: 0
< Date: Wed, 10 Nov 2021 23:38:57 GMT
< Server: EC2ws
< Connection: close
< Content-Type: text/plain
<
* Closing connection 0

kdalporto commented 2 years ago

@matthewfala, I have a bit of an update. I've realized on two separate occasions that logs have gotten sent to S3, but I wasn't sure why. This morning it realized it had occurred again, as a result of me deleting my kubernetes deployment, the logs were sent to S3. This is consistent with the documentation snippet:

"If Fluent Bit is stopped suddenly it will try to send all data and complete all uploads before it shuts down."

At the moment, I don't understand why it seems to be able to send to S3 on shutdown, but fails during normal operations.

Update: I tried to reproduce the above scenario, however no logs were sent on shutdown this time.

matthewfala commented 2 years ago

I'm not sure what the issue could be. The process of obtaining credentials during shutdown is the same as the process of obtaining credentials during normal operations. That is if the inputs (some of which have network activity) are not interfering with our requests. One thing that might be happening is that the input collectors are shut down, while the output plugins are still sending out logs. If the input plugin that is interfering with our network requests is stopped, then then that might explain why on shut down we are able to reach IMDS and during normal operations we are not. What input plugins are you using? anything that might require networking such as Prometheus?

I have a custom image which adds IMDSv1 fallback support and also some extra debug statements for IMDS problems. If you want to test this out and send the resulting logs, they could help us figure out what the problem is: (if IMDSv2 fails IMDSv1 will be tried)

Here's the image repo and tag -

826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:1.8.8-imds-fallback-patch

kdalporto commented 2 years ago

Yes, Prometheus is running in our deployment. I'll try to utilize that image and grab the logs.

kdalporto commented 2 years ago

Circling back on this... The issue was that the overall Kubernetes deployment repo we use specifically blocks pods from accessing IMDS in the namespace fluent-bit is deployed in, but access is still available from the instance level. I've confirmed running fluent-bit in it's own separate namespace allows fluent-bit to send logs to S3 with IMDS.

PettitWesley commented 2 years ago

@kdalporto Thanks for this post. I had forgotten about that, I believe its recommended in EKS and ECS to block containers from accessing IMDS.

matthewfala commented 2 years ago

Awesome @kdalporto. I'm glad to hear that this is no longer an issue for you. Thank you for letting us know.

Ahlaee commented 2 years ago

Hi, I'm using Fluent Bit v1.8.15 / aws-for-fluent-bit 2.23.4 on AWS EKS and I'm still getting this in the logs

[2022/04/29 11:16:43] [error] [filter:aws:aws.3] Could not retrieve ec2 metadata from IMDS

I'm using IMDSv2 with the correct hop limit: { "State": "applied", "HttpTokens": "required", "HttpPutResponseHopLimit": 2, "HttpEndpoint": "enabled", "HttpProtocolIpv6": "disabled", "InstanceMetadataTags": "disabled" }

curl -H "X-aws-ec2-metadata-token: INVALID" -v http://169.254.169.254/ is reporting 401 curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600" returns a token

Sending logs to Cloudwatch does work though (at least for now). So I'm not sure if this is an error message which refers to IMDSv1 while IMDSv2 is working fine.

PettitWesley commented 2 years ago

@Ahlaee Is there more log output than that?

CC @matthewfala

Ahlaee commented 2 years ago

@PettitWesley Everything else looks ok:

Fluent Bit v1.8.15

Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
https://fluentbit.io

[2022/04/29 10:33:43] [ info] [engine] started (pid=1) [2022/04/29 10:33:43] [ info] [storage] version=1.1.6, initializing... [2022/04/29 10:33:43] [ info] [storage] root path '/var/fluent-bit/state/flb-storage/' [2022/04/29 10:33:43] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128 [2022/04/29 10:33:43] [ info] [storage] backlog input plugin: storage_backlog.8 [2022/04/29 10:33:43] [ info] [cmetrics] version=0.2.2 [2022/04/29 10:33:43] [ info] [input:systemd:systemd.3] seek_cursor=s=bfc76bb2c6464c94b13827824290ea6a;i=14f... OK [2022/04/29 10:33:43] [ info] [input:storage_backlog:storage_backlog.8] queue memory limit: 4.8M [2022/04/29 10:33:43] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443 [2022/04/29 10:33:43] [ info] [filter:kubernetes:kubernetes.0] local POD info OK [2022/04/29 10:33:43] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server... [2022/04/29 10:33:43] [ info] [filter:kubernetes:kubernetes.0] connectivity OK [2022/04/29 10:33:43] [error] [filter:aws:aws.2] Could not retrieve ec2 metadata from IMDS on initialization [2022/04/29 10:33:43] [error] [filter:aws:aws.3] Could not retrieve ec2 metadata from IMDS on initialization [2022/04/29 10:33:43] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020 [2022/04/29 10:33:43] [ info] [sp] stream processor started

After that it creates the Log Streams. And then it repeats indefinitely:

[2022/04/29 20:16:46] [error] [filter:aws:aws.3] Could not retrieve ec2 metadata from IMDS

Logs are forwarded to cloudwatch nonetheless.

PettitWesley commented 2 years ago

@Ahlaee Ah this is the EC2 filter... and I think I might know the problem, you might have IMDS blocked for containers- this is a common/best practice. Does your setup include any of this? https://aws.amazon.com/premiumsupport/knowledge-center/ecs-container-ec2-metadata/

Ahlaee commented 2 years ago

@PettitWesley No, our setup runs on EKS not ECS. I never configured anything related to networking modes when spinning up the cluster using the console. As far as I understand from the linked article, having IMDS blocked is an intentional setting that must be included in the user data of the Amazon EC2 instance. I didn't include anything related to this. It might be implicitly included by AWS in the cluster creation process.

PettitWesley commented 2 years ago

@Ahlaee Hmm you're right, this looks like the right link for EKS IMDS related things: https://github.com/aws/containers-roadmap/issues/1109

After that it creates the Log Streams. And then it repeats indefinitely:

[2022/04/29 20:16:46] [error] [filter:aws:aws.3] Could not retrieve ec2 metadata from IMDS

Logs are forwarded to cloudwatch nonetheless.

Yea so the filter is failing, creds must be succeeding. Can you please share your full config?

Also since you have IMDSv2 required (tokens required), then you need to set the config in the AWS filter: https://docs.fluentbit.io/manual/pipeline/filters/aws-metadata

[FILTER]
    Name aws
    Match *
    imds_version v2

Ahlaee commented 2 years ago

I was following the AWS documentation when setting up fluent-bit for EKS: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-logs-FluentBit.html

Their fluent-bit.yaml which is linked under

https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml

contains an older image of the software, that doesn't support IMDSv2 and also has the imds_version filter set to v1.

Setting the image version to 2.23.4 and the filter to imds_version v2 as you described above solved the issue for me. :)

Thank you!

cgill27 commented 2 years ago

I concur with @Ahlaee , using EKS with AWS supplied docs for setting up fluent-bit to Cloudwatch, also setting the image to 2.23.4 and imds_version v2 solved the issue for me aswell

mconigliaro commented 2 years ago

Just setting imds_version v2 fixed this for me. FWIW, it looks like the current stable version is 2.23.3.

babebort commented 2 years ago

Seems like for me just helped changed imds_version to v2

whereisaaron commented 2 years ago

In Oct 2022 the container image version in this manifest were new enough for IMDS v2 but configuration still contained 'imds_version v1' in two places. Updating 'v1' to 'v2' (in two places) was enough to fix that.

https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml

geocomm-shenningsgard commented 1 year ago

FWIW, I just re-deployed fluent-bit public.ecr.aws/aws-observability/aws-for-fluent-bit@sha256:ff702d8e4a0a9c34d933ce41436e570eb340f56a08a2bc57b2d052350bfbc05d and started receiving the error [error] [filter:aws:aws.3] Could not retrieve ec2 metadata from IMDS. I changed the value for imds_version to v2 in both spots in the ConfigMap (and restarted the DaemonSet) and am still seeing the error.

PettitWesley commented 1 year ago

https://repost.aws/knowledge-center/ecs-container-ec2-metadata

Hop limit of 2 is required when using Docker/containers. https://awscli.amazonaws.com/v2/documentation/api/latest/reference/ec2/modify-instance-metadata-options.html

fluent / fluent-bit