amino-os / Amino.Run

Amino Distributed OS - Runtime Manager
Apache License 2.0
29 stars 12 forks source link

Umbrella issue: Kernel Server and Sapphire object monitoring #78

Open dhzhuo opened 6 years ago

dhzhuo commented 6 years ago

[Quinton] This description is out of date? See https://github.com/Huawei-PaaS/DCAP-Sapphire/issues/78#issuecomment-377374687 instead.

We can use this as a master issue to track all the individual tasks, that are approximately:

  1. 344 Implement object health checks in base sapphire server policyhealth metrics, storing them in the local kernel server.

  2. Implement health checks in the kernel server (check it's own health, and store it locally).
  3. Implement health check propagation from all kernel servers to OMS.
  4. Have OMS invoke "reportHealth()" or similar on Group policies running on OMS (so that they can take appropriate action, e.g. replace replicas).

Consider deleting the original text below.

========================== From Donghui: We need a membership management mechanism in Sapphire core:

  1. Keep track of Kernel servers
  2. Keep track of Sapphire objects and their locations
  3. Kernel server health check
  4. Publish events upon Kernel server registration and deregistration
  5. Sapphire object health check (should sapphire core do it?)
  6. Publish events upon Sapphire object creation and deletion
ghost commented 6 years ago

@DonghuiZhuo We have most of this already.

  1. OMS keeps track of Kernel Servers.
  2. OMS keeps track of SO's and their locations.
  3. Health checks I'm not so sure of, but I think we should add this to OMS.
  4. Publishing events on Kernel Server registration. That's not there yet. I'm curious what use case you have in mind?
  5. SO health check. That's not there yet. We could either build that into the DM library, to be performed on client DM's, or to the OMS.
  6. Publish events on SO creation. We already have that in OnMembershipChange() in the DM API. Do you have some other use case in mind that would not be covered by that?
dhzhuo commented 6 years ago

OMS keeps track of Kernel Servers.

OMS today only keeps a list of Kernel Servers; it does not keep track of Kernel Servers. It does not know if a Kernel Server is live or dead. It does not remove dead Kernel Servers from its registry.

OMS keeps track of SO's and their locations.

Save as above. We need a mechanism to monitor the healthiness of SO. We also need a mechanism which is able to bring up new SO instance when some SO instance dies. We need something like Replica Set at SO level.

Health checks I'm not so sure of, but I think we should add this to OMS.

Are you referring to the health check of Kernel Server or health check of SO?

Publishing events on Kernel Server registration. That's not there yet. I'm curious what use case you have in mind?

Just some thought to share. I think Kernel Server registration/deregistration events should be broadcasted to all group policies so that group policies have chances to relocate SO instances.

dhzhuo commented 6 years ago

Some thought on monitoring: https://docs.google.com/document/d/1g5SnzsnyGXzdZVDF_uj9MQJomQpHS-PMpfwnYn4RNDU/edit#heading=h.j9vsjm8kyruk

quinton-hoole commented 6 years ago

See also #195

quinton-hoole commented 5 years ago

Status update: Some progress made in design document referenced in https://github.com/Huawei-PaaS/DCAP-Sapphire/issues/459#issuecomment-447521716

prostil commented 5 years ago

Sub tasks for this issue are being worked upon

quinton-hoole commented 5 years ago

Unassigning myself. Reassign to me if there is anything you need me to do on this.

prostil commented 5 years ago

Related to #459 and #344 Moving this to backlog as all of it is not needed for Barcelona KubeCon