aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.07k stars 608 forks source link

ECS agent should handle EFS volume umount after a task container is killed #2810

Closed Cappuccinuo closed 2 years ago

Cappuccinuo commented 3 years ago

Summary

aws/efs-utils#83

The EFS volume attached to a task is not umounted after the task is killed. Umount is client side behaviour, so once the task & container is exited, the volume attached should be umounted.

Description

After the task is running, everything works fine and volume is attached. After the task is killed, the volume is still mounted, the amazon-efs-watchdog cannot kill the stunnel since the volume is not umounted, since the iam credentials are fetched from AWS_CONTAINER_CREDENTIALS_RELATIVE_URI, the watchdog is complaining

2021-02-15 03:35:03,419 - ERROR - Cannot recreate self-signed certificate
2021-02-15 03:35:04,448 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/****

Expected Behavior

The volume resource is umounted after the task & container is killed.

Observed Behavior

The volume resource is still attached to the host instance after the task & container is killed.

You can see that there are two volume attached, the volume mounted on /var/lib/ecs/volumes/ecs-Github83-1-EFS-d8d9dfe0a1d9b4918a01 cannot retrieve credentials properly

[root@ip-172-31-64-80 efs]# df
Filesystem            1K-blocks    Used        Available Use% Mounted on
...
127.0.0.1:/    9007199254739968 4194304 9007199250545664   1% /var/lib/ecs/volumes/ecs-Github83-1-EFS-d8d9dfe0a1d9b4918a01
...
127.0.0.1:/    9007199254739968 4194304 9007199250545664   1% /var/lib/ecs/volumes/ecs-Github83-2-EFS-e083d4bdace2f4908b01
...

While the container 0c3c32e57096 is already killed

[root@ip-172-31-64-80 efs]# docker ps -a
CONTAINER ID        IMAGE                            COMMAND                  CREATED             STATUS                      PORTS               NAMES
38cb273faa83        nginx                            "/docker-entrypoint.…"   35 minutes ago      Up 35 minutes               80/tcp              ecs-Github83-2-nginx-f292ca9ae7d6f7c11300
0c3c32e57096        nginx                            "/docker-entrypoint.…"   39 minutes ago      Exited (0) 35 minutes ago                       ecs-Github83-1-nginx-b688b3edf6cdabcde901
a9b8f9d326ce        amazon/amazon-ecs-agent:latest   "/agent"                 46 minutes ago      Up 46 minutes (healthy)                         ecs-agent

After umount the file system on the instance, everything works fine

[root@ip-172-31-64-80 efs]# sudo umount /var/lib/ecs/volumes/ecs-Github83-1-EFS-d8d9dfe0a1d9b4918a01
[root@ip-172-31-64-80 efs]# df
Filesystem            1K-blocks    Used        Available Use% Mounted on
...
127.0.0.1:/    9007199254739968 4194304 9007199250545664   1% /var/lib/ecs/volumes/ecs-Github83-2-EFS-e083d4bdace2f4908b01
...
[root@ip-172-31-64-80 efs]# tail -f mount-watchdog.log
2021-02-15 03:36:45,589 - ERROR - Cannot recreate self-signed certificate
# Umount the file system
2021-02-15 03:36:46,591 - INFO - No mount found for "fs-66e7bce7.var.lib.ecs.volumes.ecs-Github83-1-EFS-d8d9dfe0a1d9b4918a01.20167"
2021-02-15 03:37:16,642 - INFO - Unmount grace period expired for fs-66e7bce7.var.lib.ecs.volumes.ecs-Github83-1-EFS-d8d9dfe0a1d9b4918a01.20167
2021-02-15 03:37:16,642 - INFO - Terminating running TLS tunnel - PID: 4564, group ID: 4564
2021-02-15 03:37:16,642 - INFO - TLS tunnel: 4564 is still running, will retry termination
2021-02-15 03:37:17,644 - INFO - Unmount grace period expired for fs-66e7bce7.var.lib.ecs.volumes.ecs-Github83-1-EFS-d8d9dfe0a1d9b4918a01.20167
2021-02-15 03:37:17,644 - INFO - TLS tunnel: 4564 is no longer running, cleaning up state

Environment Details

Task definition:

{
  ...
      "mountPoints": [
        {
          "readOnly": null,
          "containerPath": "/efs",
          "sourceVolume": "EFS"
        }
      ],
...
      "image": "nginx",
...
      "name": "nginx"
    }
  ],
...
  "volumes": [
    {
      ...
      "efsVolumeConfiguration": {
        "transitEncryptionPort": null,
        "fileSystemId": "fs-12345678",
        "authorizationConfig": {
          "iam": "ENABLED",
          "accessPointId": null
        },
        "transitEncryption": "ENABLED",
        "rootDirectory": "/"
      },
      "name": "EFS",
      "host": null,
      "dockerVolumeConfiguration": null
    }
  ]
}

Supporting Log Snippets

N/A, if needed I will collect. Otherwise I think the investigation should be around how ECS agent handle the resource clean up (EFS volume umount).

sparrc commented 3 years ago

hello, please run the ecs logs collector: https://github.com/aws/amazon-ecs-logs-collector, you can email the bundle to ecs-agent-external at amazon.com

Also to clarify, what do you mean by "the task container is killed"? are you stopping the task via the ECS API? or killing the container directly on the instance via docker kill?

Cappuccinuo commented 3 years ago

From the ECS console, I believe that is through ECS API right.

Before I collect more logs, can you confirm that EFS volume should be umounted after the task is stopped ?

sparrc commented 3 years ago

to follow up, it appears that they ecs agent does unmount, but not until the task is cleaned up. The waiting time between a task stopping and being cleaned up is controlled by the env var ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION. This env var defaults to 3 hours. Could you please try setting this env var so a lower value, such as ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=10m and see if you are able to reproduce?

Cappuccinuo commented 3 years ago

Hey @sparrc ,

Thanks for the suggestion, could you please confirm that my ENV setting is correct? (I am following https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-config.html)

$ cat /etc/ecs/ecs.config
ECS_CLUSTER=Github83
ECS_BACKEND_HOST=
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=1m

Do I need to restart the agent to make this effective? Seems after this config, nothing changed, after the task is stopped on the ECS console, the issue is still persist after 1min due to the file system is not umounted.

# The time when I trigger task stop
level=info time=2021-02-17T02:09:44Z msg="Managed task has reached stopped; waiting for container cleanup" taskARN="arn:aws:ecs:us-east-1:617566850839:task/Github83/8fee1df8feb7433dba149dd6dd702cf0"
# After 1m+
level=error time=2021-02-17T02:13:19Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go

# The file system is still mounted
$ date -u
Wed Feb 17 02:15:35 UTC 2021
$ sudo df
Filesystem            1K-blocks    Used        Available Use% Mounted on
...
127.0.0.1:/    9007199254739968 4194304 9007199250545664   1% /var/lib/ecs/volumes/ecs-Github83-2-EFS-8cf0e0a597df80c6a101
sparrc commented 3 years ago

yes you will need you restart the ecs agent to pickup config changes: sudo systemctl restart ecs

Cappuccinuo commented 3 years ago

Thanks, confirmed that after restart, the file system is umounted and this issue is not happening anymore.

level=info time=2021-02-17T04:55:26Z msg="Container change is redundant" taskARN="arn:aws:ecs:us-east-1:617566850839:task/Github83/8a8df65a0a204f7e82f7a45d36635efe" container="nginx" runtimeID="16b1aea51f66d47099369f65d96aea668e7cb96735b6b70e4f96185246d5b870" status="STOPPED" knownStatus="STOPPED"
level=info time=2021-02-17T04:55:26Z msg="Waiting for task event" taskARN="arn:aws:ecs:us-east-1:617566850839:task/Github83/8a8df65a0a204f7e82f7a45d36635efe"
level=error time=2021-02-17T04:56:09Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=warn time=2021-02-17T04:56:09Z msg="Unknown eventType: GetCredentialsInvalidRoleType" module=entry_types.go
...
level=info time=2021-02-17T04:56:26Z msg="Waiting for task event" taskARN="arn:aws:ecs:us-east-1:617566850839:task/Github83/8a8df65a0a204f7e82f7a45d36635efe"
level=info time=2021-02-17T04:56:26Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-east-1:617566850839:task/Github83/8a8df65a0a204f7e82f7a45d36635efe"
level=info time=2021-02-17T04:56:26Z msg="Task engine [arn:aws:ecs:us-east-1:617566850839:task/Github83/8a8df65a0a204f7e82f7a45d36635efe]: removing container: nginx" module=docker_task_engine.go
level=info time=2021-02-17T04:56:26Z msg="Removing Container Reference: nginx from Image State- sha256:298ec0e28760b8eb1aad79711dc29c19041c61d7cf342dd1f445e91f30500549" module=types.go
level=info time=2021-02-17T04:56:26Z msg="Task engine [arn:aws:ecs:us-east-1:617566850839:task/Github83/8a8df65a0a204f7e82f7a45d36635efe]: resource cgroup cleanup complete" module=docker_task_engine.go
level=info time=2021-02-17T04:56:26Z msg="Task engine [arn:aws:ecs:us-east-1:617566850839:task/Github83/8a8df65a0a204f7e82f7a45d36635efe]: resource EFS cleanup complete" module=docker_task_engine.go
level=info time=2021-02-17T04:56:26Z msg="Task engine [arn:aws:ecs:us-east-1:617566850839:task/Github83/8a8df65a0a204f7e82f7a45d36635efe]: finished removing task data, removing task from managed tasks" module=docker_task_engine.go

Are there any method else that we can config the agent instead of manually edit the config file? e.g. Can we config this in the task definition or cluster definition?

shubham2892 commented 3 years ago

You can set the environment variable using userdata for the EC2 instance, you can find more info here -- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html#bootstrap_container_agent

xian13 commented 3 years ago

Hi @Cappuccinuo @shubham2892 thanks for updating the issue

Is setting up ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION is the final solution to this issue? If yes then I think the only value that will work is set it to the minimum which is 1m And the error still appears printed to the log for that 1 minute window.

I see many errors printed /var/log/ecs/ecs-agent when many of my containers that use EFS exit at 23:37 and cleanup completely at 23:39

level=error time=2021-02-17T23:38:40Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:38:41Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:38:42Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:38:43Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:38:44Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:38:44Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:38:45Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:38:45Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:38:46Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:38:46Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
...
level=error time=2021-02-17T23:39:28Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:39:28Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:39:28Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:39:29Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:39:29Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:39:29Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:39:29Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:39:30Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:39:30Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-02-17T23:39:31Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go

and also at /var/log/amazon/efs/mount-watchdog.log

2021-02-17 23:38:40,291 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a1ff833e-2956-4ff2-a0aa-483edc84428e
2021-02-17 23:38:40,291 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:38:41,389 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a1ff833e-2956-4ff2-a0aa-483edc84428e
2021-02-17 23:38:41,390 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:38:42,438 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a1ff833e-2956-4ff2-a0aa-483edc84428e
2021-02-17 23:38:42,438 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:38:43,489 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a1ff833e-2956-4ff2-a0aa-483edc84428e
2021-02-17 23:38:43,490 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:38:44,570 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/17047aa2-041c-4068-bb6d-8e3dc19a03b8
2021-02-17 23:38:44,571 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:38:44,621 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a1ff833e-2956-4ff2-a0aa-483edc84428e
2021-02-17 23:38:44,621 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:38:45,802 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/17047aa2-041c-4068-bb6d-8e3dc19a03b8
2021-02-17 23:38:45,802 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:38:45,876 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a1ff833e-2956-4ff2-a0aa-483edc84428e
2021-02-17 23:38:45,876 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:38:46,921 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/17047aa2-041c-4068-bb6d-8e3dc19a03b8
2021-02-17 23:38:46,921 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:38:46,962 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a1ff833e-2956-4ff2-a0aa-483edc84428e
2021-02-17 23:38:46,962 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:38:48,024 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/17047aa2-041c-4068-bb6d-8e3dc19a03b8
2021-02-17 23:38:48,025 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:38:48,098 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a1ff833e-2956-4ff2-a0aa-483edc84428e
2021-02-17 23:38:48,098 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:38:49,145 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/17047aa2-041c-4068-bb6d-8e3dc19a03b8
2021-02-17 23:38:49,145 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:38:49,186 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a1ff833e-2956-4ff2-a0aa-483edc84428e
2021-02-17 23:38:49,186 - ERROR - Cannot recreate self-signed certificate
...
021-02-17 23:39:26,069 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a1ff833e-2956-4ff2-a0aa-483edc84428e
2021-02-17 23:39:26,069 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:27,113 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/9d799bff-03b6-4561-b7af-d8996700a6ab
2021-02-17 23:39:27,113 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:27,155 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/ea3c29c8-9cd2-4275-b1ad-beda5fcc047e
2021-02-17 23:39:27,156 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:27,210 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/17047aa2-041c-4068-bb6d-8e3dc19a03b8
2021-02-17 23:39:27,210 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:27,258 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a1ff833e-2956-4ff2-a0aa-483edc84428e
2021-02-17 23:39:27,258 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:28,312 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/9d799bff-03b6-4561-b7af-d8996700a6ab
2021-02-17 23:39:28,312 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:28,381 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/ea3c29c8-9cd2-4275-b1ad-beda5fcc047e
2021-02-17 23:39:28,382 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:28,447 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/17047aa2-041c-4068-bb6d-8e3dc19a03b8
2021-02-17 23:39:28,448 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:28,496 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a1ff833e-2956-4ff2-a0aa-483edc84428e
2021-02-17 23:39:28,497 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:29,584 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/9d799bff-03b6-4561-b7af-d8996700a6ab
2021-02-17 23:39:29,584 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:29,659 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/ea3c29c8-9cd2-4275-b1ad-beda5fcc047e
2021-02-17 23:39:29,659 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:29,711 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/17047aa2-041c-4068-bb6d-8e3dc19a03b8
2021-02-17 23:39:29,711 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:29,758 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a1ff833e-2956-4ff2-a0aa-483edc84428e
2021-02-17 23:39:29,758 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:30,835 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/9d799bff-03b6-4561-b7af-d8996700a6ab
2021-02-17 23:39:30,835 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:30,902 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/ea3c29c8-9cd2-4275-b1ad-beda5fcc047e
2021-02-17 23:39:30,903 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:30,903 - INFO - No mount found for "fs-b27ee4f3.var.lib.ecs.volumes.ecs-www-worker-mail-914-JaknotWWWEFSVolume-eed1bcda838ab39e3b00.20346"
2021-02-17 23:39:30,903 - INFO - No mount found for "fs-b27ee4f3.var.lib.ecs.volumes.ecs-www-worker-mail-priority-818-JaknotWWWEFSVolume-d89faecba99aab910e00.20440"
2021-02-17 23:39:31,908 - INFO - No mount found for "fs-b27ee4f3.var.lib.ecs.volumes.ecs-www-worker-invoice-490-JaknotWWWEFSVolume-b2c290f5dce7d8883800.20330"
2021-02-17 23:39:31,960 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/ea3c29c8-9cd2-4275-b1ad-beda5fcc047e
2021-02-17 23:39:31,961 - ERROR - Cannot recreate self-signed certificate
2021-02-17 23:39:32,964 - INFO - No mount found for "fs-b27ee4f3.var.lib.ecs.volumes.ecs-www-worker-default-572-JaknotWWWEFSVolume-a294befdefaffccc5100.20122"
2021-02-17 23:40:01,045 - INFO - Unmount grace period expired for fs-b27ee4f3.var.lib.ecs.volumes.ecs-www-worker-mail-914-JaknotWWWEFSVolume-eed1bcda838ab39e3b00.20346
2021-02-17 23:40:01,045 - INFO - Terminating running TLS tunnel - PID: 1177, group ID: 1177
2021-02-17 23:40:01,045 - INFO - TLS tunnel: 1177 is still running, will retry termination
2021-02-17 23:40:01,047 - INFO - Unmount grace period expired for fs-b27ee4f3.var.lib.ecs.volumes.ecs-www-worker-mail-priority-818-JaknotWWWEFSVolume-d89faecba99aab910e00.20440
2021-02-17 23:40:01,047 - INFO - Terminating running TLS tunnel - PID: 791, group ID: 791
2021-02-17 23:40:01,050 - INFO - TLS tunnel: 791 is still running, will retry termination
2021-02-17 23:40:02,053 - INFO - Unmount grace period expired for fs-b27ee4f3.var.lib.ecs.volumes.ecs-www-worker-invoice-490-JaknotWWWEFSVolume-b2c290f5dce7d8883800.20330
2021-02-17 23:40:02,053 - INFO - Terminating running TLS tunnel - PID: 1843, group ID: 1843
2021-02-17 23:40:02,053 - INFO - TLS tunnel: 1843 is still running, will retry termination
2021-02-17 23:40:02,053 - INFO - Unmount grace period expired for fs-b27ee4f3.var.lib.ecs.volumes.ecs-www-worker-mail-914-JaknotWWWEFSVolume-eed1bcda838ab39e3b00.20346
2021-02-17 23:40:02,053 - INFO - TLS tunnel: 1177 is no longer running, cleaning up state
2021-02-17 23:40:02,054 - INFO - Unmount grace period expired for fs-b27ee4f3.var.lib.ecs.volumes.ecs-www-worker-mail-priority-818-JaknotWWWEFSVolume-d89faecba99aab910e00.20440
2021-02-17 23:40:02,054 - INFO - TLS tunnel: 791 is no longer running, cleaning up state
2021-02-17 23:40:03,057 - INFO - Unmount grace period expired for fs-b27ee4f3.var.lib.ecs.volumes.ecs-www-worker-invoice-490-JaknotWWWEFSVolume-b2c290f5dce7d8883800.20330
2021-02-17 23:40:03,057 - INFO - TLS tunnel: 1843 is no longer running, cleaning up state
2021-02-17 23:40:03,057 - INFO - Unmount grace period expired for fs-b27ee4f3.var.lib.ecs.volumes.ecs-www-worker-default-572-JaknotWWWEFSVolume-a294befdefaffccc5100.20122
2021-02-17 23:40:03,057 - INFO - Terminating running TLS tunnel - PID: 2064, group ID: 2064
2021-02-17 23:40:03,058 - INFO - TLS tunnel: 2064 is still running, will retry termination
2021-02-17 23:40:04,061 - INFO - Unmount grace period expired for fs-b27ee4f3.var.lib.ecs.volumes.ecs-www-worker-default-572-JaknotWWWEFSVolume-a294befdefaffccc5100.20122
2021-02-17 23:40:04,061 - INFO - TLS tunnel: 2064 is no longer running, cleaning up state
Cappuccinuo commented 3 years ago

@xian13 I think setting ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION value is the most straightforward way to solve the issue. You can keep the tls_cert_renewal_interval_min of efs-utils as default, and set ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION to a value that is small then 1h (1m will be the quickest time to clean up resources). As long as the watchdog does not try to retrieve the credentials from container, the error log will not be printed.

xian13 commented 3 years ago

@Cappuccinuo in the previous log I describe setting ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION value to 1m still make the error log printed If the value is greater than 1m, the printed error become significant larger

Lets say we put the value to 30m, there's a window of 30m error will be printed to the log And maybe more if the container that use EFS exit at the different time, then the error always printed to the log not just in 30m window

Shouldn't either ECS / EFS aware that if the volume is no longer use, then no need to get the credentials?

Cappuccinuo commented 3 years ago

@xian13 , can you share your efs-utils config file mount-watchdog part and confirm that the renewal interval min is larger than ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION? I set the time to 60min and there is no failed logs.

[mount-watchdog]
...

# Set client auth/access point certificate renewal rate. Minimum value is 1 minute.
tls_cert_renewal_interval_min = 60

If you set the renewal interval min > ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION, then this issue will not happen, since the watchdog only try to refresh the certificate after 60 min since mount.

On EFS side, we tell whether the volume is in use or not by if the customer is mounting the fs or not, if the fs is still mounted, then we treat the mount is still in use even though the container has been killed.

xian13 commented 3 years ago

Hi @Cappuccinuo

here is the config file

sh-4.2$ cat /etc/ecs/ecs.config
ECS_CLUSTER=Jaknot-ECS-JaknotECSSharedResources-14S6ZQBK63KMV-JaknotECSCluster-QGAEFBWVBSW7
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=1m
sh-4.2$ cat /etc/amazon/efs/efs-utils.conf
#
# Copyright 2017-2018 Amazon.com, Inc. and its affiliates. All Rights Reserved.
#
# Licensed under the MIT License. See the LICENSE accompanying this file
# for the specific language governing permissions and limitations under
# the License.
#

[DEFAULT]
logging_level = INFO
logging_max_bytes = 1048576
logging_file_count = 10
# mode for /var/run/efs and subdirectories in octal
state_file_dir_mode = 750

[mount]
dns_name_format = {fs_id}.efs.{region}.{dns_name_suffix}
dns_name_suffix = amazonaws.com
#The region of the file system when mounting from on-premises or cross region.
#region = us-east-1
stunnel_debug_enabled = false
#Uncomment the below option to save all stunnel logs for a file system to the same file
#stunnel_logs_file = /var/log/amazon/efs/{fs_id}.stunnel.log
stunnel_cafile = /etc/amazon/efs/efs-utils.crt

# Validate the certificate hostname on mount. This option is not supported by certain stunnel versions.
stunnel_check_cert_hostname = true

# Use OCSP to check certificate validity. This option is not supported by certain stunnel versions.
stunnel_check_cert_validity = false

# Define the port range that the TLS tunnel will choose from
port_range_lower_bound = 20049
port_range_upper_bound = 20449

[mount.cn-north-1]
dns_name_suffix = amazonaws.com.cn

[mount.cn-northwest-1]
dns_name_suffix = amazonaws.com.cn

[mount.us-iso-east-1]
dns_name_suffix = c2s.ic.gov

[mount.us-isob-east-1]
dns_name_suffix = sc2s.sgov.gov

[mount-watchdog]
enabled = true
poll_interval_sec = 1
unmount_grace_period_sec = 30

# Set client auth/access point certificate renewal rate. Minimum value is 1 minute.
tls_cert_renewal_interval_min = 60

[cloudwatch-log]
# enabled = true
log_group_name = /aws/efs/utils

# Possible values are : 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, and 3653
# Comment this config to prevent log deletion
retention_in_days = 14
[client-info]
source = ecs.ec2
sh-4.2$ cat /var/log/amazon/efs/mount-watchdog.log | grep "Failed to retrieve AWS security" | head -10
2021-02-20 22:39:01,850 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/5301480e-e0f7-410c-a99a-04c9a485b7b0
2021-02-20 22:39:02,961 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/7c3d8875-a970-48b5-a110-5c6f55e4a924
2021-02-20 22:39:03,086 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/2e31ea09-1ac4-41d2-9557-8e3a8fb8a742
2021-02-20 22:39:03,198 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/9817112c-82ee-4983-83bf-51d5b881b330
2021-02-20 22:39:03,332 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/691e40a0-377c-4a18-97d6-a023bea6d865
2021-02-20 22:39:03,454 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/729379d6-86f0-4525-ad51-1eaf94cf7fd3
2021-02-20 22:39:03,596 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/5301480e-e0f7-410c-a99a-04c9a485b7b0
2021-02-20 22:39:04,640 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/7c3d8875-a970-48b5-a110-5c6f55e4a924
2021-02-20 22:39:04,678 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/2e31ea09-1ac4-41d2-9557-8e3a8fb8a742
2021-02-20 22:39:04,731 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/9817112c-82ee-4983-83bf-51d5b881b330

I'm using the AWS ECS optimized AMI and not changing anything for the efs utils

Cappuccinuo commented 3 years ago

@xian13 Did you restart the ecs agent? And when the task is started and killed?

xian13 commented 3 years ago

@Cappuccinuo I insert ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION through EC2 User Data in my CloudFormation template and redeploy it

So technically it's a new instance and no need to restart it manually

Cappuccinuo commented 3 years ago

@xian13 , if the agent is start before the cloud-init, then you still need to restart the agent right? Can you confirm that the config is updated before ecs agent running?

Say at 00:00 you launch a new instance with the new user data -> at 00:05 the task is running and the file system is mounted -> at 00:10 the task is terminated -> at 01:10 the watchdog is printing the error log, this should be the clean up config is not effective.

xian13 commented 3 years ago

@Cappuccinuo Because I'm using AWS ECS Optimized Instance, isn't the flow which one cloud-ini and agent started should be already taken care of by AWS team?

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html What my understanding from this documentation is that if we want to config the agent should be done by using user data

Cappuccinuo commented 3 years ago

@xian13 , are you saying that after applying the ecs agent config, the issue is not mitigated at all? Or there are only errors during the 1m period? How many tasks are seeing such errors?

Based on your efs-utils config file, efs-utils watchdog only refresh the tls certificate after 60 min.

From my tests on a ECS agent that is restarted after the config, what I see from the log is

# Triggering task termination
2021-02-23 08:52:14,659 - DEBUG - Current local NFS mounts: [Mount(server='127.0.0.1:/', mountpoint='/var/lib/ecs/volumes/ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00', type='nfs4', options='rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20067,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1', freq='0', passno='0')]
2021-02-23 08:52:14,659 - DEBUG - Translating "fs-*.var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067" into mount point and port "var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067"
2021-02-23 08:52:14,659 - DEBUG - Current state files in "/var/run/efs": ['fs-*.var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067']
2021-02-23 08:52:14,659 - DEBUG - TLS tunnel for fs-*.var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067 is running
2021-02-23 08:52:15,660 - DEBUG - Current local NFS mounts: []
2021-02-23 08:52:15,661 - DEBUG - Translating "fs-*.var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067" into mount point and port "var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067"
2021-02-23 08:52:15,661 - DEBUG - Current state files in "/var/run/efs": ['fs-*.var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067']
2021-02-23 08:52:15,661 - INFO - No mount found for "fs-*.var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067"
2021-02-23 08:52:15,661 - DEBUG - Marking fs-*.var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067 as unmounted at 1614070335

...

2021-02-23 08:52:45,715 - INFO - Unmount grace period expired for fs-*.var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067
2021-02-23 08:52:45,715 - INFO - Terminating running TLS tunnel - PID: 16086, group ID: 16086
2021-02-23 08:52:45,716 - INFO - TLS tunnel: 16086 is still running, will retry termination
2021-02-23 08:52:46,717 - DEBUG - Current local NFS mounts: []
2021-02-23 08:52:46,717 - DEBUG - Translating "fs-*.var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067" into mount point and port "var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067"
2021-02-23 08:52:46,717 - DEBUG - Current state files in "/var/run/efs": ['fs-*.var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067']
2021-02-23 08:52:46,717 - INFO - Unmount grace period expired for fs-*.var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067
2021-02-23 08:52:46,717 - INFO - TLS tunnel: 16086 is no longer running, cleaning up state
2021-02-23 08:52:46,717 - DEBUG - Deleting /var/run/efs/stunnel-config.fs-*.var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067
2021-02-23 08:52:46,718 - DEBUG - Deleted /var/run/efs/stunnel-config.fs-*.var.lib.ecs.volumes.ecs-Github83-2-EFS-f081fa8a8af3bcdc2f00.20067
2021-02-23 08:52:47,719 - DEBUG - Current local NFS mounts: []
2021-02-23 08:52:47,719 - DEBUG - Current state files in "/var/run/efs": []
sparrc commented 3 years ago

Since @xian13 said the errors occurred for about a minute, it does sound like the ECS cleanup wait duration is taking effect. I'm curious if the polling intervals configured in the watchdog have any effect on still seeing the errors. In particular Im curious about poll_interval_sec since you appear to be getting the errors about once every second til the task is cleaned up:

[mount-watchdog]
enabled = true
poll_interval_sec = 1
unmount_grace_period_sec = 30

# Set client auth/access point certificate renewal rate. Minimum value is 1 minute.
tls_cert_renewal_interval_min = 60
xian13 commented 3 years ago

Sorry for late reply

@Cappuccinuo Yes, I'm seeing the error in 1 minute window So for now, my assumption is that the error will still printed to the log for the amount of ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION value So I think always setting the cleanup wait duration to 1 minutes is not the best choice because it also makes there's no window to check the container before it is cleaned

Here is the example again: The error printed around 38:33 to 39:33 (1 minute window) in /var/log/amazon/efs/mount-watchdog.log

sh-4.2$ cat /var/log/amazon/efs/mount-watchdog.log | grep "Failed to retrieve AWS security" | head -n 102021-03-08 04:38:33,700 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/80cfe48a-b5c6-45c5-938a-f7bd3f0fd8d4
2021-03-08 04:38:34,712 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/d1504064-7cf1-4e40-9aeb-7c89c1aa5969
2021-03-08 04:38:34,720 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/80cfe48a-b5c6-45c5-938a-f7bd3f0fd8d42021-03-08 04:38:35,749 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/d1504064-7cf1-4e40-9aeb-7c89c1aa5969
2021-03-08 04:38:35,763 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/80cfe48a-b5c6-45c5-938a-f7bd3f0fd8d4
2021-03-08 04:38:36,774 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/d1504064-7cf1-4e40-9aeb-7c89c1aa59692021-03-08 04:38:36,782 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/80cfe48a-b5c6-45c5-938a-f7bd3f0fd8d4
2021-03-08 04:38:37,798 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/d1504064-7cf1-4e40-9aeb-7c89c1aa5969
2021-03-08 04:38:37,815 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/80cfe48a-b5c6-45c5-938a-f7bd3f0fd8d42021-03-08 04:38:38,829 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/d1504064-7cf1-4e40-9aeb-7c89c1aa5969
sh-4.2$ cat /var/log/amazon/efs/mount-watchdog.log | grep "Failed to retrieve AWS security" | tail -n 10
2021-03-08 06:39:29,364 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/2ccce2a7-994c-4758-9ba8-1de5aa338efd2021-03-08 06:39:29,374 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/20a674ac-66ff-4226-bacd-4b428d6deff1
2021-03-08 06:39:30,388 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/2ccce2a7-994c-4758-9ba8-1de5aa338efd
2021-03-08 06:39:30,399 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/20a674ac-66ff-4226-bacd-4b428d6deff12021-03-08 06:39:31,412 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/2ccce2a7-994c-4758-9ba8-1de5aa338efd
2021-03-08 06:39:31,420 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/20a674ac-66ff-4226-bacd-4b428d6deff1
2021-03-08 06:39:32,433 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/2ccce2a7-994c-4758-9ba8-1de5aa338efd2021-03-08 06:39:32,442 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/20a674ac-66ff-4226-bacd-4b428d6deff1
2021-03-08 06:39:33,457 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/2ccce2a7-994c-4758-9ba8-1de5aa338efd
2021-03-08 06:39:33,475 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/20a674ac-66ff-4226-bacd-4b428d6deff1

The same error also printed in /var/log/ecs/ecs-agent.log on the same time window

sh-4.2$ cat /var/log/ecs/ecs-agent.log.2021-03-08-04 | grep "Credentials not found" | head -n 10level=error time=2021-03-08T04:38:33Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-03-08T04:38:33Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-03-08T04:38:34Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.golevel=error time=2021-03-08T04:38:34Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-03-08T04:38:34Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-03-08T04:38:34Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.golevel=error time=2021-03-08T04:38:35Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-03-08T04:38:35Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-03-08T04:38:35Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.golevel=error time=2021-03-08T04:38:35Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
sh-4.2$ cat /var/log/ecs/ecs-agent.log.2021-03-08-04 | grep "Credentials not found" | tail -n 10
level=error time=2021-03-08T04:39:31Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.golevel=error time=2021-03-08T04:39:31Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-03-08T04:39:32Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-03-08T04:39:32Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.golevel=error time=2021-03-08T04:39:32Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-03-08T04:39:32Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-03-08T04:39:33Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.golevel=error time=2021-03-08T04:39:33Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-03-08T04:39:33Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-03-08T04:39:33Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go

@sparrc is your suggestion to change the value of poll_interval_sec in efs-utils?

mythri-garaga commented 3 years ago

@xian13 Can you please provide time of the task start and stop? As @Cappuccinuo mentioned earlier the certs renewal should not happen for 60 min in this case.

xian13 commented 3 years ago

Hi @mythri-garaga can you guide me how to get the task start and stop time from the given error log on /var/log/amazon/efs/mount-watchdog.log and /var/log/ecs/ecs-agent.log?

shubham2892 commented 3 years ago

@xian13 Can you share the logs at ecs-agent-external@amazon.com.

xian13 commented 3 years ago

Hi @shubham2892 , can you give me more specific which logs to be shared? is it both /var/log/ecs/ecs-agent.log and /var/log/amazon/efs/mount-watchdog.log only or any other logs?

Last request by @mythri-garaga is the start and stop time of the task, is the information available in those files?

sparrc commented 3 years ago

I dont think we need any more logs for this case at the moment, but I believe shubham2892@ was asking for you to run the ecs logs collector

@xian13 yes please try adjusting poll_interval_sec.

xian13 commented 3 years ago

@sparrc is it like increasing poll_interval_sec into 1 or 2 minutes to reduce the chance a log is written in the 1 minute window (ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION) between task killed and cleaned?

mythri-garaga commented 3 years ago

@sparrc is it like increasing poll_interval_sec into 1 or 2 minutes to reduce the chance a log is written in the 1 minute window (ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION) between task killed and cleaned?

Yes, so that we can observe logs when mount-watchdog does not request credentials during that time frame.

sparrc commented 3 years ago

@xian13 any updates?

xian13 commented 3 years ago

Hi @sparrc sorry for late response

Today I just update the poll_interval_sec = 60 inside /etc/amazon/efs/efs-utils.conf The instances just running for 1.5 hours now

I run this commands on the instances

cat /var/log/amazon/efs/mount-watchdog.log | grep "Failed to retrieve AWS security"
cat /var/log/ecs/ecs-agent.log | grep "Credentials not found"

Until now there is no error log printed, will update this again later after more hours

So for now this is the only solution?

As from my point of view, it try to removing the log by trading it with other consequences By making ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION into 1 minute and poll_interval_sec into 60s is that:

  1. The window to look at the log before clean up is only 1 minute? Isn't that too short?
  2. What is the consequences of increasing poll_interval_sec is that the watchdog will less frequent check the daemon and adding chance the system become unresponsive for 1 minute?
xian13 commented 3 years ago

Latest update

I still see some errors in efs/mounth-wacthdog.log but not in ecs/ecs-agent.log

sh-4.2$ cat /var/log/amazon/efs/mount-watchdog.log | grep "Failed to retrieve AWS security"
2021-05-18 07:38:42,298 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/ceefc713-abe8-4197-bce9-0779f1cf562b
sh-4.2$ cat /var/log/ecs/ecs-agent.log | grep "Credentials not found"
sh-4.2$ cat /var/log/amazon/efs/mount-watchdog.log | grep "Failed to retrieve AWS security"
2021-05-18 04:39:30,134 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/21f99236-a55f-4278-abd0-fa57824ee4e8
2021-05-18 04:39:30,154 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/5a848810-c20c-40c7-8e0b-235c6bfa969f
2021-05-18 04:39:30,174 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/b36981bf-a93f-4af3-bcfb-1feafa67f566
2021-05-18 04:39:30,195 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/f15ef8df-9f79-4d34-8183-c45fd1233391
2021-05-18 05:39:32,591 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/e75805a7-a772-4043-8558-e70ee4f1524c
2021-05-18 05:39:32,600 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/4efa052e-e02e-4e78-8d8e-d38855ac2ea8
2021-05-18 05:39:32,608 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/2f696186-7b27-4358-b0e4-21067ec609a9
2021-05-18 07:38:37,863 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/eb941425-2350-4ea1-926e-95b25b2f7b9d
sh-4.2$ cat /var/log/ecs/ecs-agent.log | grep "Credentials not found"
sh-4.2$ cat /var/log/amazon/efs/mount-watchdog.log | grep "Failed to retrieve AWS security"
2021-05-18 04:39:07,472 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/d041963a-ddf8-46ba-9b5e-4b23d2c92689
2021-05-18 05:39:10,300 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/76eea90a-1a10-4a50-97ac-d7edfd699ddc
2021-05-18 05:39:10,307 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/2fe6e01b-186d-4553-9b22-ad332deb355f
2021-05-18 06:39:13,361 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/07ee52c9-b556-4944-b96e-5deb20514274
2021-05-18 06:39:13,370 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a1cd30e2-f3a9-4430-a8f5-7f900eac3dd5
2021-05-18 07:39:15,970 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/840cd58c-2a49-4b06-b8a2-161f9db0e02c
2021-05-18 07:39:15,988 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/729d7538-369f-400e-8e1e-fc973b40ad02
sh-4.2$ cat /var/log/ecs/ecs-agent.log | grep "Credentials not found"
xian13 commented 3 years ago

It turns out the error still printed on ecs-agent.log Previously the error not shown because ecs-agent.log is archived every hour

Here is the example

sh-4.2$ cat /var/log/amazon/efs/mount-watchdog.log | grep "Failed to retrieve AWS security"
2021-05-18 07:38:42,298 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/ceefc713-abe8-4197-bce9-0779f1cf562b
2021-05-18 10:38:50,118 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/fb43239b-f3f2-4e06-8263-b1384a4dc848
2021-05-18 11:38:53,006 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/63b8ec75-8749-4ce5-bec3-c8f9e52cdb29
2021-05-18 11:38:53,014 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/73070725-4678-414a-baf1-e64363c7edaa
2021-05-18 12:38:55,575 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/d648ffa0-0ec6-4894-8ad1-a9ccfe8b9805
2021-05-18 12:38:55,581 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/ad954f82-0f36-4bfd-8d67-d1fe2f265756
2021-05-18 12:38:55,588 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/583ba59a-8124-4862-bcb7-4cb8350da462
2021-05-18 13:38:58,278 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/0b0a8308-f9a9-4f3c-83c7-f89651c27e5d
2021-05-18 13:38:58,287 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/7883ee3e-ffc1-4c5c-ad2f-d7fd0fee61b3
2021-05-18 13:38:58,294 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/161d8a16-27b6-44c8-8247-8cff17543c98
2021-05-18 13:38:58,300 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a8c0ab8d-294c-4115-8d1d-cd1f4b4a7d17
2021-05-18 14:39:01,244 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/23223832-c891-4422-b407-587cdef8de37
2021-05-18 14:39:01,298 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/b651a019-92ae-49bc-81e1-479b1f026671
2021-05-18 14:39:01,344 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/d48fda30-6497-4aba-875b-79e7b0ad5262
2021-05-18 15:39:04,177 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/d5556b82-552f-4b52-888e-e37d561ab86f
2021-05-18 15:39:04,183 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/df39b653-0477-420a-8e32-92fe11273b61
2021-05-18 15:39:04,190 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/c0aeea59-ade2-4e00-bbc9-3ce4aa8a02e6
2021-05-18 15:39:04,197 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/69ff6cfb-3d12-4312-af03-89508d093ba1
2021-05-18 15:39:04,204 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/723b0139-a525-4251-b75e-3ca15d5e83d1
2021-05-18 16:39:06,873 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/61d4cc26-caf1-45a0-a6c6-fd58a41764ba
2021-05-18 16:39:06,880 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a5dc254e-4a0b-4cfd-ae59-6c08cb3a3f1b
2021-05-18 16:39:06,888 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/193d10bb-86ca-4928-a181-9d81833d5fd3
2021-05-18 16:39:06,895 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/26956789-39c9-4c21-8f1f-a33e5151d9c9
2021-05-18 17:39:09,546 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/18480f9c-2ca4-4be7-92e7-47070e78358e
2021-05-18 17:39:09,553 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/c4d5b6e4-b51d-4ca9-bb5f-c58aaba912bb
2021-05-18 17:39:09,559 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/2b13eb0c-4635-4236-87bf-ce804ca92d00
2021-05-18 17:39:09,566 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/754d1697-8df6-42a1-a4d7-4db354b70462
2021-05-18 18:39:12,348 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/d9ecde53-a762-4e05-a736-02721bcc60c1
2021-05-18 18:39:12,356 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/ba23c9d2-eac2-4a98-aef5-47e3ba8d9bc4
2021-05-18 18:39:12,363 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/1c11c32a-272b-4d30-a97f-c636123a2922
2021-05-18 18:39:12,370 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/1c234860-3e55-498f-86d1-d1f06210e105
2021-05-18 18:39:12,378 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/218a612d-a4d5-445d-84ee-077476d91d16
2021-05-18 18:39:12,385 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/3a01b365-246f-402e-9422-9c7df0d5d919
2021-05-18 19:39:15,027 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/02b8fb75-707a-4b83-b13b-b80d8d167d4f
2021-05-18 19:39:15,042 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/8c203cef-c16b-4156-b102-ed0281092da6
2021-05-18 20:39:17,863 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/522b788f-b4fa-4bd7-9319-58c490bf0afe
2021-05-18 20:39:17,869 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/9d8c8713-df44-4999-b1e7-2b7e3c8789cb
2021-05-18 21:39:21,011 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/97790514-2a6a-4acb-bc45-985684c351a0
2021-05-18 21:39:21,017 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/8c405277-852a-414f-b42b-1c14c39f8197
2021-05-18 21:39:21,025 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/798b069f-45ed-452b-a7d1-e7dba713901e
2021-05-18 21:39:21,031 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a6b6a1c6-a66d-4632-828a-c993db563143
2021-05-18 22:39:23,729 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/77308e31-8580-49f2-9346-52bbe826be7b
2021-05-18 22:39:23,735 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/899bc67b-91ff-4fdf-a837-871c89447285
2021-05-18 22:39:23,743 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/d9e70065-458c-4f3e-9936-3898a9d09af2
2021-05-18 22:39:23,749 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/9e02f601-3de7-4f96-953a-e3694998c98e
2021-05-18 23:39:26,462 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/02622010-08cc-45c7-b051-09c1d6497e32
2021-05-18 23:39:26,470 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/eadf6375-07ae-4bd5-893e-da5a945c65bf
2021-05-18 23:39:26,477 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/f84b6d97-4931-43c9-a678-e3ee7859801b
2021-05-18 23:39:26,484 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/bc3bec21-a507-4b76-a087-1060fa33da9b
2021-05-18 23:39:26,492 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/bf24d4fb-094b-4be5-839a-f6c30d42cafe
2021-05-18 23:39:26,499 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/ef97e0a3-f787-4f0c-bb8d-719a58445c88
2021-05-19 00:39:29,252 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/b521a3e3-a7aa-413a-8758-9cd114b22cf9
2021-05-19 00:39:29,259 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/94254fac-f8b8-402e-bc46-bfaf3a162609
2021-05-19 00:39:29,267 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/4cc557f8-4502-46d2-96db-2bdd5e9becac
2021-05-19 00:39:29,274 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/7077eb2e-d9ff-494f-954d-2eba7ecfea7c
2021-05-19 01:39:31,955 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/6ba031e5-64ed-4047-a263-0072095d0a67
2021-05-19 01:39:31,961 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/524935aa-cc7e-4846-bf00-921f09be0a04
2021-05-19 01:39:31,968 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/a330342d-17f0-4b34-8218-bf6d96292eb7
2021-05-19 01:39:31,974 - ERROR - Failed to retrieve AWS security credentials using lookup method: ecs:/v2/credentials/34c3c76b-56f9-4b00-b92b-1a974df444ef
sh-4.2$ cat /var/log/ecs/ecs-agent.log* | grep "Credentials not found"
level=error time=2021-05-18T07:38:42Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T07:38:42Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T10:38:50Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T10:38:50Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T11:38:53Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T11:38:53Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T11:38:53Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T11:38:53Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T12:38:55Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T12:38:55Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T12:38:55Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T12:38:55Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T12:38:55Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T12:38:55Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T13:38:58Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T13:38:58Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T13:38:58Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T13:38:58Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T13:38:58Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T13:38:58Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T13:38:58Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T13:38:58Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T14:39:01Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T14:39:01Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T14:39:01Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T14:39:01Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T14:39:01Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T14:39:01Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T15:39:04Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T15:39:04Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T15:39:04Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T15:39:04Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T15:39:04Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T15:39:04Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T15:39:04Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T15:39:04Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T15:39:04Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T15:39:04Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T16:39:06Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T16:39:06Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T16:39:06Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T16:39:06Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T16:39:06Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T16:39:06Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T16:39:06Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T16:39:06Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T17:39:09Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T17:39:09Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T17:39:09Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T17:39:09Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T17:39:09Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T17:39:09Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T17:39:09Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T17:39:09Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T18:39:12Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T18:39:12Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T18:39:12Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T18:39:12Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T18:39:12Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T18:39:12Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T18:39:12Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T18:39:12Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T18:39:12Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T18:39:12Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T18:39:12Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T18:39:12Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T19:39:15Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T19:39:15Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T19:39:15Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T19:39:15Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T20:39:17Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T20:39:17Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T20:39:17Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T20:39:17Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T21:39:21Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T21:39:21Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T21:39:21Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T21:39:21Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T21:39:21Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T21:39:21Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T21:39:21Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T21:39:21Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T22:39:23Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T22:39:23Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T22:39:23Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T22:39:23Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T22:39:23Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T22:39:23Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T22:39:23Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T22:39:23Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T23:39:26Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T23:39:26Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T23:39:26Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T23:39:26Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T23:39:26Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T23:39:26Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T23:39:26Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T23:39:26Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T23:39:26Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T23:39:26Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-18T23:39:26Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-18T23:39:26Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-19T00:39:29Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-19T00:39:29Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-19T00:39:29Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-19T00:39:29Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-19T00:39:29Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-19T00:39:29Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-19T00:39:29Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-19T00:39:29Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-19T01:39:31Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-19T01:39:31Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-19T01:39:31Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-19T01:39:31Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-19T01:39:31Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-19T01:39:31Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go
level=error time=2021-05-19T01:39:31Z msg="Error processing credential request: CredentialsV2Request: Credentials not found" module=credentials_handler.go
level=error time=2021-05-19T01:39:31Z msg="HTTP response status code is '400', request type is: credentials, and response in JSON is {\"code\":\"InvalidIdInRequest\",\"message\":\"CredentialsV2Request: Credentials not found\",\"HTTPErrorCode\":400}" module=helpers.go

I will try to update poll_interval_sec into 120

xian13 commented 3 years ago

Even after updating the poll_interval_sec into 120 the error still printed to the log

sparrc commented 3 years ago

sorry for the misdirection, but according to comment from @Cappuccinuo here: https://github.com/aws/amazon-ecs-agent/issues/2810#issuecomment-781117228 the correct config variable to adjust is tls_cert_renewal_interval_min

So Im curious what exactly is the timeline of your issue? Is it possible that this is a timing issue where you are killing tasks right around the time that they decide they need to renew their tls cert? Could you also try increaing tls_cert_renewal_interval_min to some larger value?

xian13 commented 3 years ago

@sparrc The EFS is used for application deployed as multiple tasks that has 2 types:

Both of this types will stop and created new when there is new deployment, as with new deployment there's no timeline as this depends on the release

For web server there's an autoscaling based on traffic For long running queue consumer, it will restarted every hour specific time using cron, when a container deployed because of release at minute 25 and restarted at minute 30, then the timeline is only for 5 minutes

From here https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-config.html what I understand that set up ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION to 1 minutes mean that the container is removed 1 minutes after task stopped

The tls_cert_renewal_interval_min default is 60 minutes from https://github.com/aws/amazon-ecs-agent/issues/2810#issuecomment-781117228 the tls_cert_renewal_interval_min already bigger than ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION (which now set to 1 minute)

I still believe setting ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION to 1 minute is not the best solution I think the problem raise because EFS still try to renew the certificate when the task already killed

mythri-garaga commented 3 years ago

Hi @xian13

Correct me if I am wrong, so the task that runs the “long running queue consumer” application is being stopped here every 60 minutes due to new deployment?

That would confirm that task is being stopped around the time of tls cert being renewed.

xian13 commented 3 years ago

@mythri-garaga No, there are 2 conditions that the container exit

  1. Every new deployment, which we can't predict the time
  2. Hourly at specific time cronjob style, not 60 minutes after the container created

Like my explanation earlier, if container created at minute 25, and hourly restart at minute 30, then container is restarted with 5 minutes after it's created

mythri-garaga commented 3 years ago

To further investigate this issue, could you please run the logs collector - https://github.com/aws/amazon-ecs-logs-collector and share them at at ecs-agent-external@amazon.com? Thank you.

xian13 commented 3 years ago

Hi @mythri-garaga

I've send the log to ecs-agent-external@amazon.com with subject ECS Logs for Issue 2810 Beside ecs log collector I also submit the log file that contains the error, here is the files inside of the zip:

collect-i-019ac4222530e853a.tgz
ecs-agent.log.2021-06-30-01
ecs-agent.log.2021-06-30-03
ecs-agent.log.2021-06-30-05
ecs-agent.log.2021-06-30-09
ecs-agent.log.2021-06-30-10

Thanks

sharanyad commented 3 years ago

@Cappuccinuo are there implications of setting poll_interval_sec in the efs utils conf, that changes the rate at which credentials are retrieved?

Cappuccinuo commented 3 years ago

poll_interval_sec is used for watchdog to poll the mount status, and manage the stunnel running for the tls mounts, as well as checking and updating the credentials file. Increasing the value will post potential risk on disconnect issue, e.g. if a stunnel is crash for some reason for the mount, and the watchdog waits more than 1 sec to restart the stunnel, the client won't be able to connect to server during the period.

tls_cert_renewal_interval_min happens when the mount is more than the given number of time, and the credentials should be refreshed.

Changing poll_interval_sec, can stop the log emit, it will require a watchdog reboot to make it effective. Since the config value retrieval is before the while loop check. It is not recommended to make the value too small though.

If the mount is umounted properly after the container is killed, the issue won't be happening anymore. Is that possible to immediately force umount the file system once the task is killed?

angelcar commented 2 years ago

Following up on this. I submitted a PR to lower the minimum value for ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION from 1 minute to 1 second. This might mitigate the issue you're observing.

I will post an update when the change is released so you can test.

holstvoogd commented 2 years ago

Not sure if this is the same issue, if not I will make a new one, let me know.

ECS leak a bunch of stunnel processes & kthreads, possibly only for tasks that die and/or do not start. One of my instance has a load of 250 now because of all the leaking threads. A few example processes:

....
32358 ?        Ss     0:00 /usr/bin/stunnel /var/run/efs/stunnel-config.....
32471 ?        Ss     0:00 /usr/bin/stunnel /var/run/efs/stunnel-config.....
32565 ?        Ss     0:00 /usr/bin/stunnel /var/run/efs/stunnel-config.....
32703 ?        D      0:00 [127.0.0.1-manag]
32731 ?        D      0:00 [127.0.0.1-manag]

$ ps ax | grep stunnel | wc -l
248
$ ps ax | grep manag] | wc -l
244

As mentioned, this seems to happen when I define a service for which the image does not start or dies quickly. It might just be all tasks, but since the usually don't restart that often it is unnoticable.

yinyic commented 2 years ago

Hi @holstvoogd, would you confirm whether the EFS volumes associated with the stopped task were successfully unmounted? ECS Agent should output "Task engine [$your_task_arn]: resource EFS cleanup complete". Example output can be found in this comment

If the volume was unmounted, the actual process that reaps stunnel is EFS watch dog (part of efs-utils Github repo). Here's a similar issue that may be helpful.

holstvoogd commented 2 years ago

Hi @yinyic,

I can see they are being unmounted, but with a different message: "Removing volume ..." "Unmounted volume ... successfully."

On closer inspection, the stunnels are reaped properly aswell. I just end up with tons of '[127.0.0.1-manag]' processes cluttering up my servers & cause a very high load without actually doing anything.

prateekchaudhry commented 2 years ago

Hi @holstvoogd, it is hard to tell for sure from the given information what the issue is related to.

The processes are stuck in D status, likely blocked for IO. But it is unclear what these processes are and why they are stuck.

Could you run the https://github.com/aws/amazon-ecs-logs-collector and share the logs, steps to reproduce the issue and the task def that you are using? You can email the information to ecs-agent-external@amazon.com

angelcar commented 2 years ago

Closing due to inactivity. Please feel free to reopen if necessary.