Magickbase / shaping

MIT License
2 stars 2 forks source link

Add Devops for explorers #12

Closed Keith-CY closed 1 year ago

Keith-CY commented 2 years ago
Keith-CY commented 2 years ago

Please add a matrix of current monitoring as follows

Self-host node Public node APIs
CKB Explorer
Godwoken Explorer
CKB Explorer pending transaction sync
Faucet

@ShiningRay feel free to add extra attributes

ShiningRay commented 2 years ago
Self-host node Public node APIs
CKB Explorer Staging
CKB Explorer Testnet Y
CKB Explorer Mainnet Y
Godwoken Explorer Staging gw-node
web3
gw-node
web3
Godwoken Explorer Testnet gw-node
web3
gw-node
web3
Y
Godwoken Explorer Mainnet gw-node
web3
gw-node
web3
Y
CKB Explorer Staging pending transaction sync N/A N/A
CKB Explorer Testnet pending transaction sync N/A N/A
CKB Explorer Mainnet pending transaction sync N/A N/A
CKB Explorer Faucet(Testnet) N/A N/A
Keith-CY commented 2 years ago

Last week we've encountered a case:

  1. an alert of Godwoken Mainnet stopped was thrown in discord
  2. checking the API of gwscan server and found its tip block number was not updating
  3. checking the API of gwscan's godwoken node and found its tip block number was not updating
  4. checking the API of a public godwoken node(https://github.com/godwokenrises/godwoken-info/tree/main/mainnet_v1) and found its tip block number was not updating

Then we got the conclusion that all nodes stopped and passed the issue to godwoken team.

So I think we can add more info in the alert to show the following info

  1. tip block number, timestamp of explorer's server
  2. tip block number of our self-hosted node
  3. tip block number of a public node served by another team for double check

With these messages, we can figure out the underlying reason ASAP.

And we'd better serve an online site(or use github issues) to show these statuses and publish accident reports as godwoken team did(https://godwokenstatus.statuspage.io/), not necessary but good for open-source community.

Keith-CY commented 2 years ago

We'd also need to add a monitor for the faucet service because it's an important component for developers.

I'm not sure if it's correct but when there're many pending transactions for a long time, it could be treated as broken. Any idea from @ShiningRay

Keith-CY commented 2 years ago

Any schedule of this? @ShiningRay @mdzh521

mdzh521 commented 2 years ago

一、Architecture diagram CKB

二、Apply to kubernetes

  1. Advantages of kubernetes:
  2. It can avoid single point of failure and achieve high availability of business clusters.
  3. Supports reliable and frequent container image builds and deployments with simple rollbacks and other operations.
  4. Observability can be achieved through some built-in functions of k8s, resource control, container health monitoring, and other functions (not only operating system-level information and indicators can be displayed, but also application health and other indicator signals).
  5. Portability of service deployment.
  6. Resource efficient utilization and high density utilization.
  7. kubernetes advantage:
  8. It comes with load balancing and service discovery functions.
  9. Self-healing function. (Kubernetes restarts failed containers, replaces containers, kills containers that don't respond to user-defined health checks, and doesn't advertise them to clients until they're ready for service).
  10. Encrypted storage of sensitive information, and easy application of configuration management.

三、CKB cloud native transformation design

  1. Cluster Type a. 3 masters, and 9 nodes b. ETCD External Deployment (Binary Deployment)
  2. Service Transformation c. Select Deployment control type for web class service. d. Database type service selection Statefulset control type. e. Configuration file, There are three options ⅰ. variable way ⅱ. Configmap way ⅲ. Apollo f. Password, in the form of uniform use of variables (privacy protection) g. Different environments, choose to use different Namespaces, and grant different permissions as needed. h. Calls between different services in the same environment can use the service discovery method (for example: godwoken calls ckb-node, you can directly write the service name in the calling part).
  3. Mirror repository selection We can build our own docker-harbor warehouse, or we can use a public image storage warehouse, such as using godwoken, or store it in the docker official open source warehouse, or we can build it ourselves.
  4. Database selection The main application database of choice, AWS. The rest are self-built in the K8s cluster.
  5. Access to the network Because of the high availability situation, you can choose AWS's SLB load balancing service.
  6. Project monitoring and alerting Monitoring is divided into, IAAS PAAS, Saas and Service monitoring a. IASS monitoring: monitoring through prometheus, monitoring of basic hardware (such as: server running status, cpu, memory, disk, etc.) b. PASS monitoring: monitoring through prometheus, monitoring of cluster software application status (such as: the status of each node of the k8s cluster, the running status of the database, etc.) c. Saas monitoring: monitoring through prometheus, using prometheus service discovery, to complete the monitoring of the application status in the Saas platform (such as: whether the CKB-Node service status node is Running) d. Service monitoring: Through customized monitoring, the specified monitoring requirements are completed. (For example: currently used business monitoring, monitoring items to monitor whether the blocks generated by the ckb platform are abnormal) The default alarm plugin for prometheus monitoring is Alertmanager. Through Alertmanager, a variety of alarm methods can be realized, such as: sending a specified mailbox, sending to a customized platform, etc.

四、Platform Security Design

  1. Division of security groups: Control the traffic entering the network through security groups, and set different access rules according to the functions of the services.
  2. k8s permission management: The k8s platform adopts the page mode, and the open source platform kuboard is used as the page operation. According to different roles, different access rules and RBAC permissions are set.
  3. Server login method: use the bastion machine to manage the login, and log in through the use of SSH-KEY authentication.
Keith-CY commented 2 years ago

As we've talked about hosting a web page for monitoring and issue reports as

Here's a cool service to have a try https://betterstack.com/better-uptime

mdzh521 commented 1 year ago

image image

Regarding monitoring, we have basically completed the deployment. Now I am practicing the alarm part, because Alertmanager cannot directly send alarms to discord or SMS notifications, here I am thinking about how to achieve it.

mdzh521 commented 1 year ago

I saw that ‘https://betterstack.com/better-uptime’ is charged, should we consider such a solution?

Keith-CY commented 1 year ago

I saw that ‘https://betterstack.com/better-uptime’ is charged, should we consider such a solution?

The team(20 members) which recommended this service is using the free plan. We can pick the freelancer plan if necessary

mdzh521 commented 1 year ago

That's a good idea to explore.