apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.53k stars 1.3k forks source link

Server Rack Metadata Retrieval and Persistence on Azure Environment #6532

Open xulinjintu opened 3 years ago

xulinjintu commented 3 years ago

In order to move Pinot to cloud, cloud VM rack metadata awareness is needed for maintenance purpose and disaster recovery. In Azure use cases, fault domain (FD) metadata is needed in the Pinot ecosystem during update or fault events.

In this issue we will focus on retrieving and persisting FD and VM instance information of Azure VM for Pinot servers so that later on we can use this mapping to fit in existing Server Pool Based Instance Assignment or Replica-Group Instance Assignment based strategy for high data availability purpose.

xiangfu0 commented 3 years ago

So in this case, we need to set server instances metadata and use it for segment assignment. I feel the complication is that we need to handle rack metadata change which involves the segment assignment reblance.

opschronicle commented 3 years ago

@xulinjintu wondering whether your changes can be used for AWS as I am interested to use it there. Also once the segment is tagged with failure domain zone information how the segment replicas are handled to spread across different zones?

opschronicle commented 3 years ago

@xulinjintu Also I was trying to compare the Azure to AWS and wondering whether it is possible to add (if not already added) following information for Azure a) Region b) Availability Zone c) Availability Set d) Fault Domain (Already included)

This will enable AWS also to use this implementation directly and potentially works for GCP too. Because AWS High Availability is driven by Availability Zones and Regions . This in-turn also helps Azure as well because then you can cover any failures from a particular Availability zone which is highly important for a proper HA design.

FYI @kishoreg , @siddharthteotia

xulinjintu commented 3 years ago

So in this case, we need to set server instances metadata and use it for segment assignment. I feel the complication is that we need to handle rack metadata change which involves the segment assignment reblance.

@fx19880617 This is the first step, instance assignment strategy will be addressed in another issue.

xulinjintu commented 3 years ago

@pabrahamusa This can definitely be extended to AWS, the additional work would be implementing AWS metadata retrieval processors. Special configs for AWS might be needed.

Regarding those properties we retrieve, I wonder if fault domain is unique across different regions. Availability zone is essentially a combination of FD and UD. Since update domains are subsets of fault domains, I wonder if we only need to FD to achieve HA.

opschronicle commented 3 years ago

@xulinjintu Adding just FD should work if FD is unique per AZ. However do you think it will be better to add Availability Sets , FD and UD to provide more granular control and flexibility on replicas? For HA, Muti AZ is only needed where Single points of failure (SPOF) are eliminated with an N+1 or 2N redundancy configuration. And Muti Region design can bring other complexities like replication lag so we could potentially ignore Region all together. Please note that In AWS instead of Availability Sets,FD and UD it is just AZ (Availability Zone).

mcvsubbu commented 3 years ago

@pabrahamusa , @fx19880617 , the idea here is to make it cloud neutral. The goal is to assign instances in such a way that each instance for a table is in an "island". The definition of an "island" is cloud-specific. The requirement is that the provider gives a guarantee that should there be some planned outage, no two "island"s will be taken down at the same time. For Azure, this happens to be called Fault Domain. For a data center providing native hardware, this turns out to be "rack". If there are APIs to recognize the "island" to which an instance belongs (or, APIs to ask for N instances in N -- or even M different islands), then we can potentially use those APIs to arrive at an optimal instance assignment for a table.

The approach followed is that we mark in a single place in zookeeper (at cluster level) the "ID" of the provider (e.g. "Azure" or "XYZ"). We then have specific paths in ZK where the characteristics, the API, perhaps the name of the plugin to be loaded, etc. are specified for the "ID" in question.

If the ID is not specified, an installation will not see any modifications to Znodes or any new znodes. @xulinjintu is driving the design for this, so please provide input in the design document.

cc: @kishoreg

Jackie-Jiang commented 3 years ago

Instance pool concept can be used as the availability set. Please read: https://docs.pinot.apache.org/operators/operating-pinot/instance-assignment#pool-based-instance-assignment for more context

opschronicle commented 3 years ago

@mcvsubbu I am using AWS so the terms could be more AWS specific. However anything like 'island' will cover my requirement and should work for all cloud providers as well.

So for AWS to achieve HA the steps could be

a) Dynamic tag segment with island(FD/AZ). (Current PR) b) Dynamic tag servers based on island(FD/AZ). (https://github.com/apache/incubator-pinot/issues/6688) c) All I have to do is to just provide an implementation for AWS metadata fetch as @xulinjintu suggested and rest should work for me.

desaijay230592 commented 3 years ago

Please refer to the PR #6842 for Failure Domain Retrieval Logic for Server Instances.