Evaluate spark cluster - Githubissues

NB: this is a draft

Introduction

In distributed mode, Spark uses a master/slave architecture with one central coordinator called driver which communicates with many distributed workers called executors.

First let's explain some naming conventions.

1. Master/Worker

The terms master and worker are used to describe the centralized and distributed portions of the cluster manager and refers to machines (virtual or phisical). For instance, Hadoop YARN runs a master daemon called Resource Manager and several worker daemons called Node Managers.

2. Driver/Executor

The terms driver and executor both refers to processes instead of machines. The driver is the process where the main() method runs, it does two things: convert a user program into tasks and schedule this tasks on executors. The executors are processes responsible to running individual tasks in a given Spark job. Spark can run both driver and executors on the YARN worker nodes.

Cluster manager

Spark can run over a variety of cluster managers (like Hadoop YARN or Apache Mesos) but the most straightforward way to run Spark application on a cluster is to user the built-in Standalone Cluster Manager. It consist of a master and multiple workers, each with a configured amount of memory and CPU cores.

We probably need to build an ad-hoc dockerfile to setup a Spark standalone cluster.

High availability

By default, standalone scheduling clusters are resilient to Worker failures. However, the scheduler uses a Master to make scheduling decisions, and this creates a single point of failure: if the Master crashes, no new applications can be created. In order to circumvent this, the best way to go is to user ZooKeeper.

ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, groups and naming. Spark support natively ZooKeeper, For reliable ZooKeeper service, it should be deployed in a cluster known as an ensemble. As long as a majority of the ensemble are up, the service will be available. Because Zookeeper requires a majority, it is best to use an odd number of machines.

Configuration

In order to enable this recovery mode, you can set SPARK_DAEMON_JAVA_OPTS in spark-env using this configuration:

System property	Meaning
spark.deploy.recoveryMode	Set to ZOOKEEPER to enable standby Master recovery mode (default: NONE).
spark.deploy.zookeeper.url	The ZooKeeper cluster url (e.g., 192.168.1.100:2181,192.168.1.101:2181).
spark.deploy.zookeeper.dir	The directory in ZooKeeper to store recovery state (default: /spark).

Requirements

In a preliminary analysis we need at least 6 nodes in order to setup a Spark standalone cluster, 3 of which to use as master while the remaining 3 nodes to use as workers.

DataToKnowledge / wheretolive-backend

Evaluate spark cluster #46