Open theirix opened 3 years ago
Can you please check CloudFormation web UI when deploying with compose up
to know which resource triggered this "ClusterNotFoundException: Cluster not found" error?
I can't see any reason it fails by compose up
but not by applying the same template using aws CLI (we actually just invoke the cloudformation "apply" api)
Makes me wonder if you get the deployment to run in the expected eu-west-1 region by compose up
- we should add this to debug logs
According to CF event log, the first error is in WebService resource - "Resource creation cancelled". It fails almost immediately. Again, deploying convert
-ed template with the aws cloudformation
succeeds in about five minutes (WebService creation is the longest process).
WebService have ARN with a correct region 'arn:aws:ecs:eu-west-1:909329722030:service/monty-ecs-cluster/ecs-bug-1084-WebService-IrVAGUPSOd3X', nothing suspicious.
Full logs:
2021-01-05 13:51:03 UTC+0300 ecs-bug-1084 DELETE_COMPLETE -
2021-01-05 13:51:02 UTC+0300 CloudMap DELETE_COMPLETE -
2021-01-05 13:50:18 UTC+0300 WebTaskExecutionRole DELETE_COMPLETE -
2021-01-05 13:50:17 UTC+0300 WebTCP80TargetGroup DELETE_COMPLETE -
2021-01-05 13:50:17 UTC+0300 WebTaskExecutionRole DELETE_IN_PROGRESS -
2021-01-05 13:50:17 UTC+0300 WebTCP80TargetGroup DELETE_IN_PROGRESS -
2021-01-05 13:50:16 UTC+0300 CloudMap DELETE_IN_PROGRESS -
2021-01-05 13:50:16 UTC+0300 WebTaskDefinition DELETE_COMPLETE -
2021-01-05 13:50:16 UTC+0300 WebTCP80Listener DELETE_COMPLETE -
2021-01-05 13:50:15 UTC+0300 WebServiceDiscoveryEntry DELETE_COMPLETE -
2021-01-05 13:50:15 UTC+0300 DefaultNetwork DELETE_COMPLETE -
2021-01-05 13:50:14 UTC+0300 WebTCP80Listener DELETE_IN_PROGRESS -
2021-01-05 13:50:14 UTC+0300 WebTaskDefinition DELETE_IN_PROGRESS -
2021-01-05 13:50:14 UTC+0300 WebServiceDiscoveryEntry DELETE_IN_PROGRESS -
2021-01-05 13:50:14 UTC+0300 DefaultNetwork DELETE_IN_PROGRESS -
2021-01-05 13:50:14 UTC+0300 WebService DELETE_COMPLETE -
2021-01-05 13:48:37 UTC+0300 LogGroup DELETE_COMPLETE -
2021-01-05 13:48:36 UTC+0300 Default80Ingress DELETE_COMPLETE -
2021-01-05 13:48:36 UTC+0300 DefaultNetworkIngress DELETE_COMPLETE -
2021-01-05 13:48:36 UTC+0300 Default80Ingress DELETE_IN_PROGRESS -
2021-01-05 13:48:35 UTC+0300 WebService DELETE_IN_PROGRESS -
2021-01-05 13:48:35 UTC+0300 DefaultNetworkIngress DELETE_IN_PROGRESS -
2021-01-05 13:48:35 UTC+0300 LogGroup DELETE_IN_PROGRESS -
2021-01-05 13:48:20 UTC+0300 WebService CREATE_FAILED Resource creation cancelled
2021-01-05 13:48:19 UTC+0300 ecs-bug-1084 DELETE_IN_PROGRESS User Initiated
2021-01-05 13:48:19 UTC+0300 WebService CREATE_IN_PROGRESS Resource creation Initiated
2021-01-05 13:48:16 UTC+0300 WebService CREATE_IN_PROGRESS -
2021-01-05 13:48:15 UTC+0300 WebServiceDiscoveryEntry CREATE_COMPLETE -
2021-01-05 13:48:14 UTC+0300 WebServiceDiscoveryEntry CREATE_IN_PROGRESS Resource creation Initiated
2021-01-05 13:48:12 UTC+0300 WebServiceDiscoveryEntry CREATE_IN_PROGRESS -
2021-01-05 13:48:10 UTC+0300 CloudMap CREATE_COMPLETE -
2021-01-05 13:47:44 UTC+0300 WebTaskDefinition CREATE_COMPLETE -
2021-01-05 13:47:44 UTC+0300 WebTaskDefinition CREATE_IN_PROGRESS Resource creation Initiated
2021-01-05 13:47:42 UTC+0300 WebTaskDefinition CREATE_IN_PROGRESS -
2021-01-05 13:47:40 UTC+0300 WebTaskExecutionRole CREATE_COMPLETE -
2021-01-05 13:47:31 UTC+0300 DefaultNetworkIngress CREATE_COMPLETE -
2021-01-05 13:47:31 UTC+0300 Default80Ingress CREATE_COMPLETE -
2021-01-05 13:47:31 UTC+0300 DefaultNetworkIngress CREATE_IN_PROGRESS Resource creation Initiated
2021-01-05 13:47:31 UTC+0300 Default80Ingress CREATE_IN_PROGRESS Resource creation Initiated
2021-01-05 13:47:30 UTC+0300 DefaultNetworkIngress CREATE_IN_PROGRESS -
2021-01-05 13:47:30 UTC+0300 Default80Ingress CREATE_IN_PROGRESS -
2021-01-05 13:47:29 UTC+0300 DefaultNetwork CREATE_COMPLETE -
2021-01-05 13:47:28 UTC+0300 WebTCP80Listener CREATE_COMPLETE -
2021-01-05 13:47:28 UTC+0300 DefaultNetwork CREATE_IN_PROGRESS Resource creation Initiated
2021-01-05 13:47:28 UTC+0300 WebTCP80Listener CREATE_IN_PROGRESS Resource creation Initiated
2021-01-05 13:47:26 UTC+0300 WebTCP80Listener CREATE_IN_PROGRESS -
2021-01-05 13:47:26 UTC+0300 LogGroup CREATE_COMPLETE -
2021-01-05 13:47:25 UTC+0300 CloudMap CREATE_IN_PROGRESS Resource creation Initiated
2021-01-05 13:47:25 UTC+0300 LogGroup CREATE_IN_PROGRESS Resource creation Initiated
2021-01-05 13:47:24 UTC+0300 WebTaskExecutionRole CREATE_IN_PROGRESS Resource creation Initiated
2021-01-05 13:47:24 UTC+0300 WebTCP80TargetGroup CREATE_COMPLETE -
2021-01-05 13:47:24 UTC+0300 WebTaskExecutionRole CREATE_IN_PROGRESS -
2021-01-05 13:47:24 UTC+0300 WebTCP80TargetGroup CREATE_IN_PROGRESS Resource creation Initiated
2021-01-05 13:47:23 UTC+0300 DefaultNetwork CREATE_IN_PROGRESS -
2021-01-05 13:47:23 UTC+0300 LogGroup CREATE_IN_PROGRESS -
2021-01-05 13:47:23 UTC+0300 CloudMap CREATE_IN_PROGRESS -
2021-01-05 13:47:23 UTC+0300 WebTCP80TargetGroup CREATE_IN_PROGRESS -
2021-01-05 13:47:19 UTC+0300 ecs-bug-1084 CREATE_IN_PROGRESS User Initiated
Thanks for digging into this. Can you please confirm you checked CloudFormation deployment ran by compose up
in the expected eu-west-1 region?
I hardly can imagine how aws cloudformation
differs from go code invoking the exact same API with the exact same template, anyway devil is in the details ...
Please can you also check in the CloudFormation "model" UI that the uploaded template has the expected cluster ARN set for (failing) service?
Checked the UI. Template (Stack -> Template tab) contains the same CloudFormation definition as compose convert
provides. When I run compose up, AWS made a call to eu-west-1
endpoints (I see it via Little Snitch), compose output points to it too. Stack -> Resources view shows effective ARNs in the correct region.
Maybe the problem is with setting up docker context. I tried with the following variants:
region
in ~/.aws/config
and ~/.aws/credentials
Endpoints.ecs.Region
valueAWS_DEFAULT_REGION
All variants do not work. I think this bug can be related with #1056
@ndeloof
I've also ran into this issue. I am also experiencing this weird discrepancy between CloudFormation and Docker compose. Taking the output from docker compose convert
and running it directly in CloudFormation works fine.
The reason the stack fails to create using docker compose up
is listed as "Resource creation cancelled" it looks like docker compose up
is cancelling the CloudFormation stack creation before it can finish.
To further this CloudFormation says in the events section DELETE_IN_PROGRESS - User Initiated
Then after it's finished tearing down the CloudFormation docker compose spits out that very unhelpful ClusterNotFoundException: Cluster not found
error.
The cluster is definitely there and this is not an issue with the generated CloudFormation template, is docker compose making a background request to check if the cluster exists while the CloudFormation is in the middle of running? (and this request subsequently fails for some reason)
I've tried running this with --debug enabled as well but nothing seems to show up during the stack creation phase.
AND to throw in some even more random stuff, when you run docker compose up -d
it runs fine. I have no idea what difference -d
is making in this scenario.
Hope this helps.
TL;DR the CloudFormation template is fine, docker compose is manually telling CloudFormation to cancel the stack midway through
compose up just invoke the cloudformation API to apply the converted cloudformation template, it does not query AWS API after this step. I wonder deployment fails due to some timeout (we don't set one)?
Also, "ClusterNotFoundException" is an AWS API error (as demonstrated by the Java-style of this error), so it seems there's a race condition happening on AWS-side applying CloudFormation template, where some resource require the cluster to be up but this one is not yet set.
Can you please check the AWS web console and collect the list of Cloudformation events, so we know the first event to happend that would explain this deployment failure?
Here is my output from the AWS cli, this is the most detailed logs that I can get. You can see at 2021-03-01T13:26:13.662000+00:00
there is a User Initiated stack delete and there is nothing else suspicious before then and definitely no mention of this ClusterNotFoundException
which makes me think this is coming from a different API call happening somewhere I can't see.
But why does setting this as a deamon work? What difference does -d
have in this?
running with -d we don't wath CloudFormation events to report deployment progress. I wonder an error collecting those events will trigger context being canceled, and as a result the initial cloudformation "apply template" API call is also canceled.
@ndeloof
I've also ran into this issue.
[+] Running 8/10 [+] Running 10/10 CreateInProgress User Initiated 58.3s ⠿ dev-cluster DeleteComplete 166.0s ⠿ CloudMap DeleteComplete 161.0s ⠿ BackendTCP80TargetGroup DeleteComplete 116.0s ⠿ LogGroup DeleteComplete 117.0s ⠿ Default80Ingress DeleteComplete 59.0s ⠿ BackendTaskExecutionRole DeleteComplete 117.0s ⠿ BackendTCP80Listener DeleteComplete 113.0s ⠿ BackendTaskDefinition DeleteComplete 96.0s ⠿ BackendServiceDiscoveryEntry DeleteComplete 67.0s ⠿ BackendService DeleteComplete 61.0s ClusterNotFoundException: Cluster not found.
Taking the output from docker compose convert and running it directly in CloudFormation works fine.
Also running into this now, would be nice if this could be solved
Running version 1.0.12 of the Compose CLI.
usecase: I want to run the ecs apps on FARGATE_SPOT to save money.
Also running into this issue:
Using -d
works. Also docker compose convert
+ applying via CloudFormation works too.
I also ran into this issue.
I have provided the existing ECS Cluster, VPC and Load balancer as input in the docker-compose.
Ran the docker compose up command and the docker compose started creating CloudFormation Stack then the CFN stack creation failed. Upon checking the CloudTrail API calls, I noticed that DescribeServices API call made by docker cli failed with ClientException and the error 'Cluster not found.'
From the DescribeServices API call, I observed that docker cli has passed the cluster as empty in the request parameters. Please find the below snippet.
➢ "requestParameters": {
"cluster": "",
"services": [
"arn:aws:ecs:us-east-2:<AccountID>:service/<cluster-name>/<service-name>"
]
}
It would be great if the docker cli can pass the cluster name as well in the DescribeServices API.
However, I was able to deploy the docker compose to ECS in the us-east-1 region successfully. It might be possible that ECS is picking cluster name in the us-east-1. The issue could be from ECS Side as well.
@SatyaHarish9 could you please use docker compose convert
to get the CloudFormation template generated from your compose model and confirm the Service
resource is created with an empty cluster
attribute?
internal note: relevant code here https://github.com/docker/compose-cli/blob/main/ecs/cloudformation.go#L244
Hello @ndeloof
I have used docker compose convert
command to get the CloudFormation template. In the template, I can see that Cluster
property of AWS::ECS::Service
resource has cluster ARN as value.
However, after docker cli initiating the CloudFormation Stack creation, it was trying to describe service using DescribeServices
API and there it was not passing cluster name in request parameters.
[] DescribeServices - Request Parameters - https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_DescribeServices.html#API_DescribeServices_RequestParameters
From the above document, ECS assumes the default cluster if the cluster is not passed in request parameters. This parameter is required if the service or services described were launched in any cluster other than the default cluster.
@ndeloof
I also running into this. What is interesting after:
docker compose convert
in metadata is right arn for cluster that I set in docker-compose. Hovewer docker compose up
ends with
ClusterNotFoundException: Cluster not found.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Reproduced original bug with the following changes:
-D
(debug output) does not exist anymore--context
is moved to the global docker option. So the third step is docker --context=myecs compose up
.InvalidParameterException: Invalid identifier: Identifier is for cluster monty-ecs-cluster. Your cluster is default
Passing a detached flag -d
helps to avoid the problem.
However, it is not always appropriate to use detached mode - you should check the cluster and services status via API calls to ECS which is not why you want to use the cloud-abstract docker compose
approach at the first time.
This issue has been automatically marked as not stale anymore due to the recent activity.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Tried with latest Docker version, still the same problem.
docker version
Client:
Cloud integration: v1.0.24
Version: 20.10.17
API version: 1.41
Go version: go1.17.11
Git commit: 100c701
Built: Mon Jun 6 23:04:45 2022
OS/Arch: darwin/amd64
Context: default
Experimental: true
Server: Docker Desktop 4.10.0 (82025)
Engine:
Version: 20.10.17
API version: 1.41 (minimum version 1.12)
Go version: go1.17.11
Git commit: a89b842
Built: Mon Jun 6 23:01:23 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.6
GitCommit: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
runc:
Version: 1.1.2
GitCommit: v1.1.2-0-ga916309
docker-init:
Version: 0.19.0
GitCommit: de40ad0
docker info
Client:
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc., v0.8.2)
compose: Docker Compose (Docker Inc., v2.6.1)
extension: Manages Docker extensions (Docker Inc., v0.2.7)
sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc., 0.6.0)
scan: Docker Scan (Docker Inc., v0.17.0)
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 23
Server Version: 20.10.17
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
runc version: v1.1.2-0-ga916309
init version: de40ad0
Security Options:
seccomp
Profile: default
cgroupns
Kernel Version: 5.10.104-linuxkit
Operating System: Docker Desktop
OSType: linux
Architecture: x86_64
CPUs: 3
Total Memory: 2.921GiB
Name: docker-desktop
ID: 5DPZ:RDY2:6FBI:J2WH:YQS5:4PK3:5TPR:WBIN:7FCP:VI6X:6XCC:AV7Y
Docker Root Dir: /var/lib/docker
Debug Mode: false
HTTP Proxy: http.docker.internal:3128
HTTPS Proxy: http.docker.internal:3128
No Proxy: hubproxy.docker.internal
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
hubproxy.docker.internal:5000
127.0.0.0/8
Live Restore Enabled: false
This issue has been automatically marked as not stale anymore due to the recent activity.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
The latest Docker 20.10 has the exact same problem:
InvalidParameterException: Invalid identifier: Identifier is for cluster monty-ecs-cluster. Your cluster is default
docker version
Client:
Cloud integration: v1.0.29
Version: 20.10.21
API version: 1.41
Go version: go1.18.7
Git commit: baeda1f
Built: Tue Oct 25 18:01:18 2022
OS/Arch: darwin/amd64
Context: default
Experimental: true
Server: Docker Desktop 4.15.0 (93002)
Engine:
Version: 20.10.21
API version: 1.41 (minimum version 1.12)
Go version: go1.18.7
Git commit: 3056208
Built: Tue Oct 25 18:00:19 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.10
GitCommit: 770bd0108c32f3fb5c73ae1264f7e503fe7b2661
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
docker info
Client:
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc., v0.9.1)
compose: Docker Compose (Docker Inc., v2.13.0)
dev: Docker Dev Environments (Docker Inc., v0.0.5)
extension: Manages Docker extensions (Docker Inc., v0.2.16)
sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc., 0.6.0)
scan: Docker Scan (Docker Inc., v0.22.0)
Server:
Containers: 1
Running: 0
Paused: 0
Stopped: 1
Images: 23
Server Version: 20.10.21
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 770bd0108c32f3fb5c73ae1264f7e503fe7b2661
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Security Options:
seccomp
Profile: default
cgroupns
Kernel Version: 5.15.49-linuxkit
Operating System: Docker Desktop
OSType: linux
Architecture: x86_64
CPUs: 3
Total Memory: 2.92GiB
Name: docker-desktop
ID: REDACTED
Docker Root Dir: /var/lib/docker
Debug Mode: false
HTTP Proxy: http.docker.internal:3128
HTTPS Proxy: http.docker.internal:3128
No Proxy: hubproxy.docker.internal
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
hubproxy.docker.internal:5000
127.0.0.0/8
Live Restore Enabled: false
aws --version
aws-cli/2.9.11 Python/3.11.1 Darwin/21.6.0 source/x86_64 prompt/off
2+ years this issue has been open with absolutely nothing being done about it. This was such a pressing issue for my team that we had to stop using this at all. I have no idea why the decision was made to add this as a feature, then not maintain it at all. This is also not the only issue like this, it seems anything related to the docker compose - aws/ecs is just left behind.
This issue has been automatically marked as not stale anymore due to the recent activity.
bumping this for attention.
FYI I've opened a patch for this via #2269, but given the imminent EOL (#2258, docker/compose-ecs#7), it will (most likely) not be merged in this repository, so I've opened docker/compose-ecs#19 to track this issue, and docker/compose-ecs#20 to patch it in the new repo.
Description
docker compose up fails with "ClusterNotFoundException: Cluster not found" when an ECS cluster is precreated.
Steps to reproduce the issue:
x-aws-*
variables. Check a simple compose file.docker compose --context myecs up -D
Describe the results you received:
Got the 'ClusterNotFoundException: Cluster not found.'
The output of
docker compose --context myecs up -D
:Additional facts:
Cluster definitely exists that can be verified by AWS GUI or
aws --profile aroot --region eu-west-1 ecs list-clusters
If a cluster was not pre-created, compose up succeeds.
It does not matter if ECS cluster was created by TerraForm or by GUI
Applying CloudFormation script created by convert succeeds, cluster is created:
So that means that a failing check was performed in the compose cli. Did not try to launch compose cli after applying CloudFormation though.
Describe the results you expected:
Expected that the app is up.
Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:Output of
docker context show
:You can also run
docker context inspect context-name
to give us more details but don't forget to remove sensitive content.Output of
docker info
:Additional environment details (AWS ECS, Azure ACI, local, etc.):
AWS profile is set up with all IAM permissions.
Resources are in eu-west-1.
AWS CLI version: aws-cli/2.1.13 Python/3.9.1 Darwin/19.6.0 source/x86_64 prompt/off