CentaurusInfra / mizar

Mizar – Experimental, High Scale and High Performance Cloud Network https://mizar.readthedocs.io
https://mizar.readthedocs.io
GNU General Public License v2.0
112 stars 50 forks source link

[Arktos-Mizar-Integration] In multiple TP scenario, mizar operators are providing same conflicting grpc server host address #591

Closed Hong-Chang closed 2 years ago

Hong-Chang commented 2 years ago

What happened: In current arktos-mizar integration design, each arktos TP will have a mizar operator running. So in 2 TP case, there will be 2 operators running. Each operator has its only context and shall not interfere each other.

Mizar operator is also a grpc server. And in arktos, there is mizar controllers as grpc clients to communicate with grpc server. When there is certain event happening in cluster, for example, a pod created, mizar pod controller (as grpc client in arktos) will notify operator (as grpc server in mizar) the event and mizar operator will do corresponding operations.

Mizar operators are using host ip as its grpc server address. Currently mizar operators are running in RP. So it's possible in multiple TP scenario, two different operators are running in same RP machine, so there will be different grpc servers using the same ip address as grpc server address. When a client sends the event to grpc server, it will connect to unrelated grpc server which is unexpected.

In scenario of 2TP 1RP, the issue will be steadily repro since there will be 2 operators running in one RP machine.

What you expected to happen: Different grpc servers are identical each other. To be unique, they shall be in different ip addresses or same ip address plus different ports.

How to reproduce it (as minimally and precisely as possible): Start arktos in 2TP 1RP mode, we will observe events from one TP sends to 2 operators.

vinaykul commented 2 years ago

Mizar is not the right place to put workarounds for Arktos issues. Mizar's operator yaml has allowed it to run on any node which means it can be placed alongside heavy workload pods. This is not correct. It can introduce unreliability to the system if it needs to compete with user workload pods for resources. Operator should be treated as a system-infra pod and placed on the master node. This was fixed in a backward-compatibe way in CL https://github.com/CentaurusInfra/mizar/commit/d9038c1390f1cfcf5f540b8787c25101431c9956

With Arktos scaleout architecture, if for some reason Arktos is unable to run the operator pods on TP master node, Arktos should ensure that there are enough nodes for the number of operators it intends to deploy (NUM_NODES >= NUM_OPERATOR_PODS) and use anti-affinity to ensure there is no operator collision.