cloudstax / firecamp

Serverless Platform for the stateful services
https://www.cloudstax.io
Apache License 2.0
209 stars 20 forks source link

Tasks can't be started on an EC2 instance #41

Closed jazzl0ver closed 6 years ago

jazzl0ver commented 6 years ago

There was an issue with one of EC2 instances, so I've terminated it and the ASG has started new one. For some reason, the containers are not starting up on the new instance. They fail with (from /var/log/docker):

time="2018-03-02T10:07:55Z" level=info msg="2018/03/02 10:07:55 http: panic serving @: runtime error: invalid memory address or nil pointer dereference" plugin=3c95129f659d2d162550065c4200980c15d8d2ce25c002c9f01f96c84f3ea636
time="2018-03-02T10:07:55.565099089Z" level=warning msg="Unable to connect to plugin: /run/docker/plugins/3c95129f659d2d162550065c4200980c15d8d2ce25c002c9f01f96c84f3ea636/firecampvol.sock/VolumeDriver.Mount: Post http://%2Frun%2Fdocker%2Fplugins%2F3c95129f659d2d162550065c4200980c15d8d2ce25c002c9f01f96c84f3ea636%2Ffirecampvol.sock/VolumeDriver.Mount: EOF, retrying in 1s"

So, no volumes are mounted.

[root@ip-172-22-2-212 log]# docker ps
CONTAINER ID        IMAGE                                        COMMAND             CREATED             STATUS              PORTS               NAMES
634ef2c0ea2f        cloudstax/firecamp-amazon-ecs-agent:latest   "/agent"            5 hours ago         Up 5 hours                              ecs-agent
[root@ip-172-22-2-212 log]# docker plugin ls
ID                  NAME                               DESCRIPTION                                     ENABLED
3c95129f659d        cloudstax/firecamp-volume:latest   firecamp volume plugin for docker               true
656134559eb0        cloudstax/firecamp-log:latest      firecamp log plugin for docker: consume lo...   true

The only thing I did against this cluster recently was killing firecamp-manageserver task to make it updated to the latest. Other two cluster nodes work w/o issues. The only difference I see is the agent version:

grep Agent /var/log/firecamp/firecamp-dockervolume.INFO

"Amazon ECS Agent - v1.16.2 (55b7b5f)" - at the new (non-working) instance "Amazon ECS Agent - v1.16.0 (e24ae08)" - at working instances

Please, help me to figure that out!

JuniusLuo commented 6 years ago

Thanks for reporting the issue! Killing the manageserver task will not cause this issue. Did you collect the logs on the node before terminating it? The /var/log/firecamp/*. Could you please collect them if you could reproduce the issue?

jazzl0ver commented 6 years ago

No, I didn't collect the logs, but the issue is still in place. There's only one file in /var/log/firecamp: firecamp-dockervolume.ip-172-22-2-212.root.log.INFO.20180302-100613.6 You may get it here: http://termbin.com/dva2

JuniusLuo commented 6 years ago

Thanks! Could you please use cli to send a list member request and collect the manageserver logs in cloudwatch?

JuniusLuo commented 6 years ago

The command is like: ./firecamp-service-cli -cluster=firecamp-qa -op=list-members -service-name=kafka-qa

JuniusLuo commented 6 years ago

One more question: when was service kafka-qa created? The cluster has been there for a long time?

jazzl0ver commented 6 years ago

Very strange:

# ./firecamp-service-cli -cluster=firecamp-qa -op=list-members -service-name=kafka-qa
2018-03-02 21:12:00.228983364 +0000 UTC ListServiceMember error Get http://firecamp-manageserver.firecamp-qa-firecamp.com:27040/?List-ServiceMember: EOF

CW logs during that call:

I0302 21:09:08.369488 1 server.go:114] request Method GET URL /?List-ServiceMember ?List-ServiceMember Host firecamp-manageserver.firecamp-qa-firecamp.com:27040 requuid req-b906e338444c4fe359c3f2d196b75eae headers map[User-Agent:[Go-http-client/1.1] Content-Length:[100] Accept-Encoding:[gzip]]
I0302 21:09:08.369764 1 server.go:743] listServiceMembers &{us-east-1 firecamp-qa kafka-qa } requuid req-b906e338444c4fe359c3f2d196b75eae
I0302 21:09:08.406271 1 dynamodb_service.go:81] get service &{firecamp-qa kafka-qa ae9a07f638c145866458232d81edbead} requuid req-b906e338444c4fe359c3f2d196b75eae
I0302 21:09:08.411854 1 dynamodb_servicemember.go:270] list serviceMembers succeeded, serviceUUID ae9a07f638c145866458232d81edbead limit 0 requuid req-b906e338444c4fe359c3f2d196b75eae resp count 0xc420624a38
2018/03/02 21:09:08 http: panic serving 172.22.2.56:40344: runtime error: invalid memory address or nil pointer dereference
goroutine 74 [running]:
net/http.(*conn).serve.func1(0xc420606000)
/usr/local/go/src/net/http/server.go:1721 +0xd0
panic(0x1556be0, 0x220cb80)
/usr/local/go/src/runtime/panic.go:489 +0x2cf
github.com/cloudstax/firecamp/db/awsdynamodb.(*DynamoDB).attrsToServiceMember(0xc4203a5d10, 0xc42061b1c0, 0x20, 0xc420604ff0, 0x0, 0xc420601d80, 0x31)
/home/junius/work/go/src/github.com/cloudstax/firecamp/db/awsdynamodb/dynamodb_servicemember.go:383 +0x8f9
github.com/cloudstax/firecamp/db/awsdynamodb.(*DynamoDB).listServiceMembersWithLimit(0xc4203a5d10, 0x7f2a9c288bb0, 0xc4206321e0, 0xc42061b1c0, 0x20, 0x0, 0x2220a20, 0x0, 0xc4201776c0, 0x4, ...)
/home/junius/work/go/src/github.com/cloudstax/firecamp/db/awsdynamodb/dynamodb_servicemember.go:288 +0xba5
github.com/cloudstax/firecamp/db/awsdynamodb.(*DynamoDB).ListServiceMembers(0xc4203a5d10, 0x7f2a9c288bb0, 0xc4206321e0, 0xc42061b1c0, 0x20, 0xc420634288, 0x8, 0xc420514390, 0x0, 0x0)
/home/junius/work/go/src/github.com/cloudstax/firecamp/db/awsdynamodb/dynamodb_servicemember.go:229 +0x66
github.com/cloudstax/firecamp/manage/server.(*ManageHTTPServer).listServiceMembers(0xc4207247e0, 0x7f2a9c288bb0, 0xc4206321e0, 0x21c9820, 0xc4205e01c0, 0xc42000ac00, 0xc4206321b0, 0x24, 0x0, 0x1365642, ...)
/home/junius/work/go/src/github.com/cloudstax/firecamp/manage/server/server.go:751 +0xabc
github.com/cloudstax/firecamp/manage/server.(*ManageHTTPServer).getOp(0xc4207247e0, 0x7f2a9c288bb0, 0xc4206321e0, 0x21c9820, 0xc4205e01c0, 0xc42000ac00, 0xc4206320f5, 0x13, 0xc4206321b0, 0x24, ...)
/home/junius/work/go/src/github.com/cloudstax/firecamp/manage/server/server.go:573 +0x50b
github.com/cloudstax/firecamp/manage/server.(*ManageHTTPServer).ServeHTTP(0xc4207247e0, 0x21c9820, 0xc4205e01c0, 0xc42000ac00)
/home/junius/work/go/src/github.com/cloudstax/firecamp/manage/server/server.go:133 +0xce8
net/http.serverHandler.ServeHTTP(0xc4206ccbb0, 0x21c9820, 0xc4205e01c0, 0xc42000ac00)
/usr/local/go/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc420606000, 0x21ca560, 0xc4206000c0)
/usr/local/go/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
/usr/local/go/src/net/http/server.go:2668 +0x2ce

According to ECS console, the kafka service was created on Dec 12 and last updated on Jan 18. Yeah, this is our 1st firecamp cluster, it was created on Dec 12.

JuniusLuo commented 6 years ago

Thanks for sharing the detail information! This is an upgrade issue. The "Status" field in ServiceMember was added at Dec 30th for release 0.9.2. So the service created before that will not have the "Status" field. The upgrade is handled after 0.9.2. The data structure change after 0.9.2 is handled, but not before 0.9.2. Is it ok for you to delete the service and recreate it?

jazzl0ver commented 6 years ago

Looks like I can't delete it:

# ./firecamp-service-cli -cluster=firecamp-qa -op=delete-service -service-name=kafka-qa
2018-03-05 09:37:34.966205164 +0000 UTC DeleteService error Delete http://firecamp-manageserver.firecamp-qa-firecamp.com:27040/?Delete-Service: EOF

I've tried to downgrade firecamp-manager to 0.9.2 (and use cli of the same version) - no success: same thing in the logs.

PS Well, it actually was deleted..

JuniusLuo commented 6 years ago

Your are right. The latest manageserver is not able to delete the service. It will hit the same issue when listing all service members.

Using the downgraded firecamp-manager would be able to delete the service. For the deletion failure, could you please post the manageserver log?

jazzl0ver commented 6 years ago

I'm sorry - I've already deleted the old manageserver log. It contained the same stuff like I posted here before

JuniusLuo commented 6 years ago

The old log is not needed. It hits the same listing service member issue.

You mentioned the deletion against the downgraded manager also failed. Do you have that log? Just want to double check there is no any potential issue.

jazzl0ver commented 6 years ago

Sorry - this is the one I've deleted. At the moment I have a log of newly created firecamp cluster. As I mentioned earlier, the delete-service command returned EOF, but it did the actual delete

JuniusLuo commented 6 years ago

Thanks. If the newly created service works well, we could close this issue?

One additional note: we can't guarantee the upgrade always works for different versions of the latest release. The latest release is under development. The upgrade from the old version after 0.9.2 to the new version will be supported. If some new fields are added to the service data structure, the cli tool will be provided to upgrade the old service to the new release. We will have the detail guide for it in the next release, for upgrading from release 0.9.4.

If you meet any issue, please feel free to report it. Thanks!

jazzl0ver commented 6 years ago

Got it, thank you!