Closed yantk-hk closed 3 years ago
seeing the exact same problem, I installed using the helm chart from google stable repo, here's the manifests that end up in the cluster:
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::my-acc:role/fluent-bit
meta.helm.sh/release-name: fluent-bit
meta.helm.sh/release-namespace: logging
labels:
app: fluent-bit
app.kubernetes.io/managed-by: Helm
chart: fluent-bit-2.10.1
heritage: Helm
release: fluent-bit
name: fluent-bit
namespace: logging
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: eks.privileged
labels:
app: fluent-bit
controller-revision-hash: 7d55f48cd8
pod-template-generation: "1"
release: fluent-bit
name: fluent-bit-2g698
namespace: logging
spec:
containers:
- env:
- name: AWS_DEFAULT_REGION
value: eu-west-1
- name: HOSTNAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: AWS_ROLE_ARN
value: arn:aws:iam::my-acc:role/fluent-bit
- name: AWS_WEB_IDENTITY_TOKEN_FILE
value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
image: fluent/fluent-bit:1.6-debug
imagePullPolicy: Always
name: fluent-bit
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/log
name: varlog
- mountPath: /var/lib/docker/containers
name: varlibdockercontainers
readOnly: true
- mountPath: /fluent-bit/etc/fluent-bit.conf
name: config
subPath: fluent-bit.conf
- mountPath: /fluent-bit/etc/fluent-bit-service.conf
name: config
subPath: fluent-bit-service.conf
- mountPath: /fluent-bit/etc/fluent-bit-input.conf
name: config
subPath: fluent-bit-input.conf
- mountPath: /fluent-bit/etc/fluent-bit-filter.conf
name: config
subPath: fluent-bit-filter.conf
- mountPath: /fluent-bit/etc/fluent-bit-output.conf
name: config
subPath: fluent-bit-output.conf
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: fluent-bit-token-65zvs
readOnly: true
- mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
name: aws-iam-token
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: fluent-bit
serviceAccountName: fluent-bit
volumes:
- name: aws-iam-token
projected:
defaultMode: 420
sources:
- serviceAccountToken:
audience: sts.amazonaws.com
expirationSeconds: 86400
path: token
- hostPath:
path: /var/log
type: ""
name: varlog
- hostPath:
path: /var/lib/docker/containers
type: ""
name: varlibdockercontainers
- configMap:
defaultMode: 420
name: fluent-bit-config
name: config
- name: fluent-bit-token-65zvs
secret:
defaultMode: 420
secretName: fluent-bit-token-65zvs
and the fluent-bit config:
[FILTER]
Name kubernetes
Match kube.*
Kube_Tag_Prefix kube.var.log.containers.
Kube_URL. https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude On
[OUTPUT]
Name es
Match *
Host my-es-domain.eu-west-1.es.amazonaws.com
Port 443
Logstash_Format On
Retry_Limit False
Type _doc
Time_Key @timestamp
Replace_Dots On
Logstash_Prefix my-domain
AWS_Auth On
AWS_Region eu-west-1
tls On
and the fluent-bit logs:
Fluent Bit v1.6.1
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2020/10/21 13:51:00] [ info] [engine] started (pid=1)
[2020/10/21 13:51:00] [ info] [storage] version=1.0.6, initializing...
[2020/10/21 13:51:00] [ info] [storage] in-memory
[2020/10/21 13:51:00] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/10/21 13:51:00] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2020/10/21 13:51:00] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2020/10/21 13:51:00] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2020/10/21 13:51:00] [ info] [filter:kubernetes:kubernetes.0] API server connectivity OK
[2020/10/21 13:51:00] [ warn] net_tcp_fd_connect: getaddrinfo(host=''): Name or service not known
[2020/10/21 13:51:00] [error] [io] connection #41 failed to: :443
[2020/10/21 13:51:00] [ info] [sp] stream processor started
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=11550335 watch_fd=1 name=/var/log/containers/alertmanager-kube-prometheus-stack-alertmanager-0_monitoring_alertmanager-bbd78510252b994ae61670696334be15c0f531b091829a450137a319e88a4178.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=9452023 watch_fd=2 name=/var/log/containers/alertmanager-kube-prometheus-stack-alertmanager-0_monitoring_config-reloader-2c0eca22b0b789b324075ec787fea4e4c9cc4e05ec902702155dce64ac2315b5.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2146617 watch_fd=3 name=/var/log/containers/aws-node-jnh68_kube-system_aws-node-f5d7f8524182bd4e58639665e66c4bef88c1d341147002b6277115a8287c0fd7.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=44106514 watch_fd=4 name=/var/log/containers/aws-node-termination-handler-r228s_kube-system_aws-node-termination-handler-0406067f9f3385355f8aabfb09cd2ae4f1f8050b40118e37dda7cfc424a748d8.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=6564993 watch_fd=5 name=/var/log/containers/cert-manager-6f657bd884-qzz8b_cert-manager_cert-manager-ca6753d0b6f36ff3d1296ecbd8dbb98e4a73f36c7a0fcbcbaa7813874fcb9e35.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=19931777 watch_fd=6 name=/var/log/containers/cert-manager-webhook-cdb5c8884-fm4ll_cert-manager_cert-manager-d7cf10a24e970c39f31f559ba5597d03cd9bd7cde26b855762a488a8e3e706e4.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=22025337 watch_fd=7 name=/var/log/containers/cluster-autoscaler-chart-aws-cluster-autoscaler-chart-56776q4lj_kube-system_aws-cluster-autoscaler-chart-b1176a841aef00ed3607c8b5b3d281174f2aa5882fd325f1bbf0a7ef299fbc6d.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=27283613 watch_fd=8 name=/var/log/containers/external-dns-6bcd486cbb-mfnc9_default_external-dns-cec5f935de2f6b33e58990fa2fcac283c802a06029ebef84955a78700fe0a8e8.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=25177292 watch_fd=9 name=/var/log/containers/ingress-nginx-controller-66dc9984d8-lvgbl_default_controller-3c2f1518500ad04125eb4ddae521736835d5954593ea0225a1914fa7e71cd68f.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=37753150 watch_fd=10 name=/var/log/containers/kube-prometheus-stack-kube-state-metrics-66789f8885-55p7v_monitoring_kube-state-metrics-c652ec07303dab30431dacb8af3ee662b262d4c6f0d7c74d2b5c0208630a6009.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=6563946 watch_fd=11 name=/var/log/containers/kube-prometheus-stack-operator-f4c99ffb7-7kcqg_monitoring_kube-prometheus-stack-517db77d62039dc269d655441517e58af09590490547eb874ae5ad4ba4d44fa5.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=39868468 watch_fd=12 name=/var/log/containers/kube-prometheus-stack-operator-f4c99ffb7-7kcqg_monitoring_tls-proxy-b3730cf541be60158c2c7f82f013a3b9a44582ae4f8fc5c43e52c5143630fc30.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=13651330 watch_fd=13 name=/var/log/containers/kube-prometheus-stack-prometheus-node-exporter-nthzc_monitoring_node-exporter-fef57c689fd7c0bd5771eeb0f4b6dcb626707e202afdf852aaedfd2416df9a0d.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=20973204 watch_fd=14 name=/var/log/containers/kube-proxy-hshn9_kube-system_kube-proxy-ceab9231e552b168ea86a73c00a1433b1be0e8696698acf0a65311690e0a51d8.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=30578779 watch_fd=15 name=/var/log/containers/node-problem-detector-6vxjl_kube-system_node-problem-detector-f57b63acf906730858ee06d2fead9cffe5dfabc2472ed4219ec9307a2423e1ea.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=17841555 watch_fd=16 name=/var/log/containers/fluent-bit-6bwdd_logging_fluent-bit-afed86fbe22f5309d7754a263daadcd8eca6677b57548849a822a66c6718904b.log
[2020/10/21 13:51:01] [error] [output:es:es.0] HTTP status=403 URI=/_bulk, response:
{"Message":"User: arn:aws:sts::my-acc:assumed-role/my-cluster20200904094215498700000007/i-0bc48f7cbce18e8e6 is not authorized to perform: es:ESHttpPost"}
The role mentioned in that last log statement is the instance / node profile, the same issue described by the OP.
I have progressed the issue by following this advice to block access to the node role, now the fluent-bit logs read:
[2020/10/21 14:08:12] [error] [aws_credentials] Could not read shared credentials file /root/.aws/credentials
[2020/10/21 14:08:12] [error] [aws_credentials] Failed to retrieve credentials for AWS Profile default
[2020/10/21 14:08:12] [ warn] net_tcp_fd_connect: getaddrinfo(host=''): Name or service not known
[2020/10/21 14:08:12] [error] [io] connection #54 failed to: :443
[2020/10/21 14:08:12] [error] [aws_client] connection initialization error
[2020/10/21 14:08:12] [error] [aws_credentials] STS assume role request failed
[2020/10/21 14:08:12] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
[2020/10/21 14:08:12] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
[2020/10/21 14:08:12] [error] [signv4] Provider returned no credentials, service=es
[2020/10/21 14:08:12] [error] [output:es:es.0] could not sign request with sigv4
[2020/10/21 14:08:12] [ warn] [engine] failed to flush chunk '1-1603288814.10956238.flb', retry in 583 seconds: task_id=91, input=tail.0 > output=es.0
A noble stranger on the provider-aws kubernetes slack channel gave me a workaround that fixes this issue for myself and the stranger, specify the AWS_STS_Endpoint in the OUTPUT config:
[OUTPUT]
Name es
Match *
Host my-es-domain.eu-west-1.es.amazonaws.com
Port 443
Logstash_Format On
Retry_Limit False
Type _doc
Time_Key @timestamp
Replace_Dots On
Logstash_Prefix my-domain
AWS_Auth On
AWS_Region eu-west-1
AWS_STS_Endpoint https://sts.eu-west-1.amazonaws.com <-- here, notice the region might be different for you
tls On
Hi there! I am the 'noble stranger' mentioned above. 😅 Apologies for not filing the bug beforehand, I thought it was just something weird with the AWS account I was using.
Anyhow, I see that no one has posted debug logs for this yet, so I'll post this snippet from mine from when I ran into this issue last week, since that's what led me to go down the STS endpoint config path:
[2020/10/15 00:50:39] [debug] [aws_credentials] Init called on the EKS provider
[2020/10/15 00:50:39] [debug] [aws_credentials] Calling STS..
[2020/10/15 00:50:39] [ warn] net_tcp_fd_connect: getaddrinfo(host=''): Name or service not known
[2020/10/15 00:50:39] [error] [io] connection #39 failed to: :443
[2020/10/15 00:50:39] [debug] [upstream] connection #39 failed to :443
[2020/10/15 00:50:39] [debug] [aws_client] connection initialization error
[2020/10/15 00:50:39] [debug] [aws_credentials] STS assume role request failed
It may also be worth noting that I am using the amazon/aws-for-fluent-bit
image that uses Fluent Bit 1.6
I think this is probably a bug... IAM Roles for SA calls STS... we made a change to the STS endpoint code to enable custom endpoints.
I bet there's a bug there...
I can confirm that this happens on ECS/Fargate with Firelens also.
Setting AWS_STS_Endpoint helps.
@hoegertn Are you specifying an IAM role with the aws_role_arn parameter?
I'm about to put up a PR to fix this... basically calling STS is broken (which happens if you use EKS IRSA or a custom role).
Yes, I am assuming a role that has ES permissions. As you mentioned the STS call is broken as it does not know the hostname to contact.
Yeah, basically it's because the config map sets ""
as the default for aws_sts_endpoint
instead of NULL
. This leads the code to incorrectly think that there is an custom STS endpoint, and then Fluent Bit tries to make a request to ""
.
https://github.com/fluent/fluent-bit/blob/master/plugins/out_es/es.c#L804
At least that's what I'm testing right now..
This was fixed in 1.6.2
AWS for Fluent Bit has not been updated yet since we are still trying to fix https://github.com/fluent/fluent-bit/issues/2715
Is this really fixed?
@ypicard Yes. Please open a new issue if you are having credential issues: https://github.com/aws/aws-for-fluent-bit
2 years have passed, and the issue still exists. Installed 2.1.8 using helm. fluent-bit is unable to use the role in the serviceaccount (IRSA) and defaults to the node role. Uninstalled and re-installed with AWS_STS_Endpoint in the OUTPUT es section, and that made no difference at all.
[2023/08/23 06:18:30] [error] [output:es:es.0] HTTP status=403 URI=/_bulk, response: {"Message":"User: arn:aws:sts::xxxx:assumed-role/xxxx/i-084a7914b30b44399 is not authorized to perform: es:ESHttpPost because no identity-based policy allows the es:ESHttpPost action"}
@tejarora is also happening in somehow for me, getting STS assume role request failed
because the pod looks for a different path file to get the token. I've describe it in the above link.
Bug Report
Describe the bug Fluent Bit 1.6 - ES Plugin: Keep sourcing credential from EC2 instance rather than IAM Roles for Service Account on Amazon EKS Worker Node
To Reproduce
Create an Amazon Elasticsearch domain version 7.7 with open access
Create a service account in EKS cluster with IAM Roles for Service Account & corresponding AWS IAM policies (e.g. es*)
Upgrade fluent bit from 1.5 to 1.6 and keep using configuration in EKS Configmap
Following error would shown in fluent bit stdout/log in EKS. Following error messages keep appearing and it shows the pod or fluent bit keep sourcing AWS credential from the underlying EKS worker node (EC2 instance) rather than the annotated EKS IAM Roles for Service Account (IRSA).
[2020/10/16 09:52:24] [error] [output:es:es.3] HTTP status=403 URI=/_bulk, response: {"error":{"root_cause":[{"type":"security_exception","reason":"no permissions for [indices:data/write/bulk] and User [name=arn:aws:iam::XXX873347XXX:role/eksctl-cluster-1-nodegroup-ng-al1-NodeInstanceRole-7GZZR0O6HRQS, backend_roles=[arn:aws:iam::XXX873347XXX:role/eksctl-cluster-1-nodegroup-ng-al1-NodeInstanceRole-7GZZR0O6HRQS], requestedTenant=null]"}],"type":"security_exception","reason":"no permissions for [indices:data/write/bulk] and User [name=arn:aws:iam::XXX873347XXX:role/eksctl-cluster-1-nodegroup-ng-al1-NodeInstanceRole-7GZZR0O6HRQS, backend_roles=[arn:aws:iam::XXX873347XXX:role/eksctl-cluster-1-nodegroup-ng-al1-NodeInstanceRole-7GZZR0O6HRQS], requestedTenant=null]"},"status":403}
Expected behavior Source AWS credential by using the EKS IAM Roles for Service Account (IRSA) but not underlying Worker Node