fluent / helm-charts

Helm Charts for Fluentd and Fluent Bit
Apache License 2.0
366 stars 438 forks source link

Fluentbit pods logs get STS assume role request failed - could not sign request with sigv4 #480

Open bgarcial opened 3 months ago

bgarcial commented 3 months ago

Hi dear community,

I've installed fluentbit helm chart on K8s (AWS EKS) and I am working with IAM roles for service accounts (this way) to send logs to aws opensearch service. When telling to the fluentbit deployment to work with the serviceaccount that map my role on aws It seems it is looking for a /var/run/secrets/eks.amazonaws.com/serviceaccount/aws-iam-token file to get the token:

image

but the default and mounted path is /var/run/secrets/eks.amazonaws.com/serviceaccount/token:

image

Then it cannot fetch the credentials to assume the role .. In somehow when creating the role and the service account, the env variable injected is AWS_WEB_IDENTITY_TOKEN_FILE : /var/run/secrets/eks.amazonaws.com/serviceaccount/token but the pod look for /var/run/secrets/eks.amazonaws.com/serviceaccount/aws-iam-token

as a result I got this error on the fluentbit pod logs:

[2024/03/27 10:46:29] [error] [aws_credentials] STS assume role request failed
[2024/03/27 10:46:29] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
[2024/03/27 10:46:29] [error] [signv4] Provider returned no credentials, service=es
[2024/03/27 10:46:29] [error] [output:opensearch:opensearch.0] could not sign request with sigv4
[2024/03/27 10:46:29] [ warn] [engine] chunk '1-1711536378.324856511.flb' cannot be retried: task_id=22, input=tail.0 > output=opensearch.0
[2024/03/27 10:46:29] [ info] [input] tail.0 resume (mem buf overlimit)

I understand ths is a known issue but when checking, it is not clear how this can be solved:

Yeah, basically it's because the config map sets "" as the default for aws_sts_endpoint instead of NULL. This leads the code to incorrectly think that there is an custom STS endpoint, and then Fluent Bit tries to make a request to "". https://github.com/fluent/fluent-bit/blob/master/plugins/out_es/es.c#L804

But that issue about fluentbit from app code perspective getting "" is supposed to be fixed now (I am using v2.2.2 ) It also says as a workaroud of setting the parameter AWS_STS_Endpoint , but did not work and for some people neither.

Just for the record this is my output opensearch plugin configuration:

   [OUTPUT]
        Name opensearch
        Match host.*
        Host vpc-xxxxr-eks-logs-test-qxxxxojgfi4d7fuoshm5e.eu-west-1.es.amazonaws.com
        Port 443
        AWS_Role_ARN arn:aws:iam::xxx:role/fluentbit-to-ope-xxxx-test-fluentbit-serviceaccount
        Logstash_Format On
        Logstash_Prefix node-logs
        Retry_Limit False
        AWS_Auth On
        AWS_Region eu-west-1
        tls On
        Trace_Output On
        Trace_Error On
        AWS_STS_Endpoint https://sts.eu-west-1.amazonaws.com

and that iam role (which is mapped from a k8s service account) has this policy attached

{
    "Statement": [
        {
            "Action": [
                "es:ESHttpPut",
                "es:ESHttpPost",
                "es:ESHttpGet",
                "es:ESHttpDelete"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Action": "es:*",
            "Effect": "Allow",
            "Resource": "arn:aws:es:eu-west-1:xxxxxx:domain/xxr-eks-logs-test"
        }
    ],
    "Version": "2012-10-17"
}

I think the problem is what mentioned at the beginning pod look for /var/run/secrets/eks.amazonaws.com/serviceaccount/aws-iam-token file but the env variable injected is AWS_WEB_IDENTITY_TOKEN_FILE : /var/run/secrets/eks.amazonaws.com/serviceaccount/token and that is why it does not find the token to assume the role.

How can I change the fluentbit configuration form the helm chart via parameters? It seems the injected service account token is the default managed by aws eks itself but the deployment pod from the helm chart look for a slightly diff path.

I will appreciate if someone can point me in a good direction :slightly_smiling_face:

iamwep commented 3 months ago

I'm running into exactly the same issue. Did you found any workaround/solution yet ?

bgarcial commented 3 months ago

@iamwep not yet. I have been reading several issues here and on 'aws-for-fluent-bit' side and there is no clarity about what could be happening. What I described here is that I think is happening under the volume mount perspective of the token from the service account (when working with IRSA) but here they'd that this could also be a problem of too many requests to the 'sts' endpoint and Amazon throttling when trying the request. I am really not sure about it, as I am testing this in a K8S test environment where there is almost no traffic regarding outbound requests

Wyifei commented 2 months ago

I'm facing the same issue that I want to connect to AWS Kinesis in another account, assume role doesn't work with below error message: [2024/04/18 02:30:32] [error] [aws_credentials] STS assume role request failed [2024/04/18 02:30:32] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry. [2024/04/18 02:30:32] [error] [signv4] Provider returned no credentials, service=kinesis [2024/04/18 02:30:32] [error] [aws_client] could not sign request

helm chart: [OUTPUT] Name kinesis_streams Match * stream test region eu-central-1 role_arn arn:aws:iam::1234567890:role/kiness

bgarcial commented 2 months ago

@Wyifei @iamwep I solved this issue The I am role for service account should be only provided to the serviceAccount.annotations field on the helm chart. It means only here and not on the Open Search fluentbit output plugin (when using AWS_Role_ARN here). I was providing it on both and that’s why the sts request was failing. The output plugin doesn’t require it as this contacts the node group that supports the EKS cluster via its IAM role node group, for collecting and as the service account already have the permissions desired. Let me know if that could be your case too. :)

Wyifei commented 2 months ago

@bgarcial I solve the issue by upgrade helm chart to from 0.21.1 to 0.46.1