Closed jazzl0ver closed 6 years ago
Thanks for reporting the issue! Killing the manageserver task will not cause this issue. Did you collect the logs on the node before terminating it? The /var/log/firecamp/*. Could you please collect them if you could reproduce the issue?
No, I didn't collect the logs, but the issue is still in place. There's only one file in /var/log/firecamp: firecamp-dockervolume.ip-172-22-2-212.root.log.INFO.20180302-100613.6 You may get it here: http://termbin.com/dva2
Thanks! Could you please use cli to send a list member request and collect the manageserver logs in cloudwatch?
The command is like: ./firecamp-service-cli -cluster=firecamp-qa -op=list-members -service-name=kafka-qa
One more question: when was service kafka-qa created? The cluster has been there for a long time?
Very strange:
# ./firecamp-service-cli -cluster=firecamp-qa -op=list-members -service-name=kafka-qa
2018-03-02 21:12:00.228983364 +0000 UTC ListServiceMember error Get http://firecamp-manageserver.firecamp-qa-firecamp.com:27040/?List-ServiceMember: EOF
CW logs during that call:
I0302 21:09:08.369488 1 server.go:114] request Method GET URL /?List-ServiceMember ?List-ServiceMember Host firecamp-manageserver.firecamp-qa-firecamp.com:27040 requuid req-b906e338444c4fe359c3f2d196b75eae headers map[User-Agent:[Go-http-client/1.1] Content-Length:[100] Accept-Encoding:[gzip]]
I0302 21:09:08.369764 1 server.go:743] listServiceMembers &{us-east-1 firecamp-qa kafka-qa } requuid req-b906e338444c4fe359c3f2d196b75eae
I0302 21:09:08.406271 1 dynamodb_service.go:81] get service &{firecamp-qa kafka-qa ae9a07f638c145866458232d81edbead} requuid req-b906e338444c4fe359c3f2d196b75eae
I0302 21:09:08.411854 1 dynamodb_servicemember.go:270] list serviceMembers succeeded, serviceUUID ae9a07f638c145866458232d81edbead limit 0 requuid req-b906e338444c4fe359c3f2d196b75eae resp count 0xc420624a38
2018/03/02 21:09:08 http: panic serving 172.22.2.56:40344: runtime error: invalid memory address or nil pointer dereference
goroutine 74 [running]:
net/http.(*conn).serve.func1(0xc420606000)
/usr/local/go/src/net/http/server.go:1721 +0xd0
panic(0x1556be0, 0x220cb80)
/usr/local/go/src/runtime/panic.go:489 +0x2cf
github.com/cloudstax/firecamp/db/awsdynamodb.(*DynamoDB).attrsToServiceMember(0xc4203a5d10, 0xc42061b1c0, 0x20, 0xc420604ff0, 0x0, 0xc420601d80, 0x31)
/home/junius/work/go/src/github.com/cloudstax/firecamp/db/awsdynamodb/dynamodb_servicemember.go:383 +0x8f9
github.com/cloudstax/firecamp/db/awsdynamodb.(*DynamoDB).listServiceMembersWithLimit(0xc4203a5d10, 0x7f2a9c288bb0, 0xc4206321e0, 0xc42061b1c0, 0x20, 0x0, 0x2220a20, 0x0, 0xc4201776c0, 0x4, ...)
/home/junius/work/go/src/github.com/cloudstax/firecamp/db/awsdynamodb/dynamodb_servicemember.go:288 +0xba5
github.com/cloudstax/firecamp/db/awsdynamodb.(*DynamoDB).ListServiceMembers(0xc4203a5d10, 0x7f2a9c288bb0, 0xc4206321e0, 0xc42061b1c0, 0x20, 0xc420634288, 0x8, 0xc420514390, 0x0, 0x0)
/home/junius/work/go/src/github.com/cloudstax/firecamp/db/awsdynamodb/dynamodb_servicemember.go:229 +0x66
github.com/cloudstax/firecamp/manage/server.(*ManageHTTPServer).listServiceMembers(0xc4207247e0, 0x7f2a9c288bb0, 0xc4206321e0, 0x21c9820, 0xc4205e01c0, 0xc42000ac00, 0xc4206321b0, 0x24, 0x0, 0x1365642, ...)
/home/junius/work/go/src/github.com/cloudstax/firecamp/manage/server/server.go:751 +0xabc
github.com/cloudstax/firecamp/manage/server.(*ManageHTTPServer).getOp(0xc4207247e0, 0x7f2a9c288bb0, 0xc4206321e0, 0x21c9820, 0xc4205e01c0, 0xc42000ac00, 0xc4206320f5, 0x13, 0xc4206321b0, 0x24, ...)
/home/junius/work/go/src/github.com/cloudstax/firecamp/manage/server/server.go:573 +0x50b
github.com/cloudstax/firecamp/manage/server.(*ManageHTTPServer).ServeHTTP(0xc4207247e0, 0x21c9820, 0xc4205e01c0, 0xc42000ac00)
/home/junius/work/go/src/github.com/cloudstax/firecamp/manage/server/server.go:133 +0xce8
net/http.serverHandler.ServeHTTP(0xc4206ccbb0, 0x21c9820, 0xc4205e01c0, 0xc42000ac00)
/usr/local/go/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc420606000, 0x21ca560, 0xc4206000c0)
/usr/local/go/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
/usr/local/go/src/net/http/server.go:2668 +0x2ce
According to ECS console, the kafka service was created on Dec 12 and last updated on Jan 18. Yeah, this is our 1st firecamp cluster, it was created on Dec 12.
Thanks for sharing the detail information! This is an upgrade issue. The "Status" field in ServiceMember was added at Dec 30th for release 0.9.2. So the service created before that will not have the "Status" field. The upgrade is handled after 0.9.2. The data structure change after 0.9.2 is handled, but not before 0.9.2. Is it ok for you to delete the service and recreate it?
Looks like I can't delete it:
# ./firecamp-service-cli -cluster=firecamp-qa -op=delete-service -service-name=kafka-qa
2018-03-05 09:37:34.966205164 +0000 UTC DeleteService error Delete http://firecamp-manageserver.firecamp-qa-firecamp.com:27040/?Delete-Service: EOF
I've tried to downgrade firecamp-manager to 0.9.2 (and use cli of the same version) - no success: same thing in the logs.
PS Well, it actually was deleted..
Your are right. The latest manageserver is not able to delete the service. It will hit the same issue when listing all service members.
Using the downgraded firecamp-manager would be able to delete the service. For the deletion failure, could you please post the manageserver log?
I'm sorry - I've already deleted the old manageserver log. It contained the same stuff like I posted here before
The old log is not needed. It hits the same listing service member issue.
You mentioned the deletion against the downgraded manager also failed. Do you have that log? Just want to double check there is no any potential issue.
Sorry - this is the one I've deleted. At the moment I have a log of newly created firecamp cluster. As I mentioned earlier, the delete-service command returned EOF, but it did the actual delete
Thanks. If the newly created service works well, we could close this issue?
One additional note: we can't guarantee the upgrade always works for different versions of the latest release. The latest release is under development. The upgrade from the old version after 0.9.2 to the new version will be supported. If some new fields are added to the service data structure, the cli tool will be provided to upgrade the old service to the new release. We will have the detail guide for it in the next release, for upgrading from release 0.9.4.
If you meet any issue, please feel free to report it. Thanks!
Got it, thank you!
There was an issue with one of EC2 instances, so I've terminated it and the ASG has started new one. For some reason, the containers are not starting up on the new instance. They fail with (from /var/log/docker):
So, no volumes are mounted.
The only thing I did against this cluster recently was killing firecamp-manageserver task to make it updated to the latest. Other two cluster nodes work w/o issues. The only difference I see is the agent version:
"Amazon ECS Agent - v1.16.2 (55b7b5f)" - at the new (non-working) instance "Amazon ECS Agent - v1.16.0 (e24ae08)" - at working instances
Please, help me to figure that out!