The E2Data software stack employs the Apache YARN resource management framework in order to dynamically acquire resources and dispatch tasks to them. Traditionally, the resource types YARN could manage was memory and virtual cores. As per YARN 3.x.x hardware accelerators such as GPUs and FPGAs are also supported. This is of crucial importance for E2Data, whose aim is to establish heterogeneous processing in modern big data frameworks (e.g., Apache Flink). Given two diverse sets of tasks and hardware resources, the E2Data scheduler identifies the optimal mapping between the two and enforces its execution. As it is YARN that allocates resources during execution, it should be able to get as input a hardware device and dispatch a task to it. However, in a YARN installation, when GPUs or FPGAs are the case, a NodeManager can only be aware that an accelerator exists or not. It cannot tell the differences among various similar devices and cannot pick the specific one that the user (or a scheduler) instructs it to.
For tackling this deficiency, we have modified YARN to support multiple GPU types. This document presents the outline of our approach as well as some technical information.
The key idea of our approach is that we handle each GPU as a different resource type. By default, YARN considers two distinct resource types for accelerators:
We extend this list by attaching the model of a device to the name of its type. For example, considering a Tesla V100-SXM2-32GB GPU is available, we need to declare in YARN’s configuration that the corresponding NodeManager supports the yarn.io/gpu-teslav100sxm232gb resource type. For ensuring that a specific device is uniquely described within a cluster, we make the following naming convention: We get as model name the one that is returned by the nvidia-smi utility. We then remove all spaces, replace any "-" characters with the empty character and turn all letters to lowercase.
mvn install -Dcontainer-executor.conf.dir=<path of container-executor.cfg> -DskipTests -Pnative -Pdist -Dtar
If the build process finishes successfully, a tar file with YARN's binaries should appear under: $HADOOP_HOME/hadoop-dist/target
$ chown root $HADOOP_HOME/bin/container-executor
$ chmod u+s $HADOOP_HOME/bin/container-executor
$ chmod u-rwx $HADOOP_HOME/bin/container-executor
$ chmod g+s,o-rwx $HADOOP_HOME/bin/container-executor
/opt/hadoop-container-conf
.$HADOOP_HOME/bin/container-executor --checksetup
cgcreate -t <user>:<group> -a <user>:<group> -g cpu:yarn
cgcreate -t <user>:<group> -a <user>:<group> -g devices:yarn