MatrixAI / Overwatch

Distributed Infrastructure Telemetry
2 stars 0 forks source link

Overwatch/Adaptation Roadmap (QoS Semantics, Performance Monitoring, Language Design) #2

Open CMCDragonkai opened 6 years ago

CMCDragonkai commented 6 years ago

This roadmap deals with the development of Overwatch and Adaptation core modules (and their relationship to the Architect language). This roadmap is not final, subject to edits and change. I encourage all to expand subcomponents and write down more details development and exploration continues.

Overwatch/Adaptation

QoS Semantics

In order to scale a distributed infrastructure, it must be possible to measure the performance behaviour of our Automatons. This includes things like network latency and throughput, compute usage, storage usage and more. The measurement of these details becomes exponentially complex as infrastructure scales and more Automatons enter the Matrix network. These details are intended to be fed into our adaptation system so recommendations can be made on the Matrix graph that the Orchestrator will then apply. These applications are intended to optimise a multivariate cost function. One of the important variables is the total cloud utilisation cost. We expect that the substrate will be at first one of the standard Cloud IAAS like AWS, GCP and Azure. All of these services have complex cost accounting systems, and they often charge different aspects of cloud computing on a utility basis.

  1. Investigate distributed system algorithms. This is usually about synchronisation and state management. This is often concerned with the correctness of a particular manipulation of state across a distributed system, but our optimisation concerns will often involve partial knowledge, where only parts of the system will know the performance metrics of the systems they are directly interacting with. We need to create a set of vocabulary that is useful for these concepts. The reason why this is important is also because not all constraints are about performance, some are about reliability and availability. See #6.
  2. Investigate Network QoS theory. Everything from measuring latency to bandwidth, to consideration of thoughput overhead, and figure out the tools that are capable of measuring this on a Linux platform. See #5.
  3. Figure BPF and other relevant Linux tracing tools. We only need to get the right set of tools that will be useful for the constraint specification that is set in the Architect language. See #4 for latency monitoring and #9.
  4. Write down a set of possible QoS constraints that can be applied to Automaton composition and deployment. Speak to @CMCDragonkai about this, there are aspects of this that will be proprietary.
  5. From the possible QoS constraints, figure out the syntax and language primitives that can support higher level constructs: "Network Combinators". See #7.

Optimisation

See #8.

  1. Investigate relevant optimisation algorithms that can deal with partial knowledge (where only disparate parts of the system are aware of the relevant metrics for their immediate neighbours).
  2. Investigate orchestration and scheduling problems experienced by existing systems such as Mesos (Dominant Resource Fairness), Kubernetes, and other High Performance Computing problems. DIfferentiate batch tasks to service-like tasks.

Language Design

  1. Haskell Compiler Tools - speak to @CMCDragonkai about this