Chatie / server

Cloud Management Service for Chatie
https://www.chatie.io
Apache License 2.0
3 stars 2 forks source link

Chatie API Server Down Accident Report #98

Open su-chang opened 1 year ago

su-chang commented 1 year ago

Token Service Discovery Service Accident

Our wechaty puppet service discovery service has been experiencing out-of-service issues from 3 pm Feb 7.

  1. 10 am Feb 7: notice the disk usage of some instances are abnormal, then clear logs file and make instance keep running right, at the same time the api.chatie.io work well
  2. 3 pm Feb 7: this problem outbreak in the afternoon then we working on it, and found that the http response status code 503 of api.chatie.io
  3. 2 am Feb 8: @huan show some detail info from heroku, see: https://github.com/Chatie/server/issues/97#issuecomment-1421208269
  4. 8 am Feb 8: confirm api.chatie.io out-of-service due receive too many requests (init token on api.chatie.io) in few seconds
  5. 9 am Feb 8: find the bug in wechaty-puppet-workpro, one NodeJS Timer function init token on api.chatie.io has not been clear right, and we notice that the only way which could fix this bug temporarily is to restart all containers
  6. 10 am Feb 8: confirm the operation time to restart all containers
  7. 2 pm Feb 8: restart all containers
  8. 2:30 pm Feb8`: the server fully restored
  9. 6 pm Feb 8: create the hotfix PR to fix this problem
  10. 9 pm Feb 8: PR has been merged, and ready to deploy
  11. 0 pm Feb 9: start deploy for some instances
su-chang commented 1 year ago

TODO

We will continue to deploy the fixed version to rest instances before Feb 15

huan commented 1 year ago

Could you explain point 5: why wechaty-puppet-workpro needs to init token on api.chatie.io?

If I remember correctly, the wechaty service discovery is managed by Wechaty itself?

su-chang commented 1 year ago

wechaty-puppet-workpro is based on wechaty-grpc, but not wechaty-puppet-service.

And the logic about init token on api.chatie.io is maintained by workpro.

huan commented 1 year ago

I think this is a bad idea, but I hope it will work well in the future.

The protocol might be changed someday so please be prepared to follow new protocols.

su-chang commented 1 year ago

Thanks for your advice, we will pay attention to it.