This Docker container contains a full Hadoop distribution with the following components:
For all below steps, the Docker image segence/hadoop:latest
has to be built or pulled from DockerHub.
./build-docker-image.sh
docker pull segence/hadoop:latest
The default SSH port of the Docker containers is 2222
.
This is, so in a standalone cluster setup, each namenode and datanode
can be ran on separate physical server (or virtual appliances). The servers can
still use the default SSH port 22
but the datanodes, running inside Docker containers,
will be using port 2222
. This port is automatically exposed.
It's important to note that all exposed ports have to be open on the firewall of
the physical servers, so other nodes in the cluster can access them.
Hadoop host names have to follow the **hadoop-** format. Automatic acceptance of SSH host key fingerprints is enabled for any hosts with the domain name `hadoop-`. This is so Hadoop nodes on the cluster can establish SSH connections between each other without manually accepting host keys. This means that the Docker hosts, which run the Hadoop namenode and datanode containers have to be mapped to DNS names. One can use for example Microsoft Windows Server to set up a DNS server.
This will set up a local Hadoop cluster using bridged networking with one namenode and one datanode.
cd cluster-setup/local-cluster
slaves-config/slaves
file if you want to add more slaves (datanodes)
other than the default one slave node. If you add more slaves then also edit the
docker-compose.yml
file by adding more slave node configurations.docker-compose up -d
You can log into the namenode (master) by issuing docker exec -it hadoop-namenode bash
and to the datanode (slave) by docker exec -it hadoop-datanode1 bash
.
By default, the HDFS replication factor is set to 1, because it is assumed that
a local Docker cluster will be started with a single datanode.
To override the replication setting simply change the HDFS_REPLICATION_FACTOR
environment variable in the docker-compose.yml
file (and also add more datanodes).
Adding more data nodes adds the complexity of exposing all datanode UI ports to
localhost
. In this scenario, no UI ports should be exposed to avoid the conflict.
Create the hadoop
user on the host system, e.g. useradd hadoop
RHEL/CentOS note
If you encounter the following error message when running a Docker container:
WARNING: IPv4 forwarding is disabled. Networking will not work.
then turn on packet forwading (RHEL 7): /sbin/sysctl -w net.ipv4.ip_forward=1
You can use the script cluster-setup/standalone-cluster/setup-rhel.sh
to achieve
the above, as well as to create the required directories and change their ownership
(as in point 1 and 2 below, so you can skip them if you used the RHEL setup script).
The cluster setup runs with host networking, so the Hadoop nodes will get the hostname and DNS settings directly from the host machine. Make sure IP addresses and DNS names, as well as DNS resolution is correctly set up on the host machines.
/hadoop/data
/hadoop/deployments
hadoop
user own the directories: chown -R hadoop:hadoop /hadoop
/hadoop/slaves-config/slaves
listing all slave node host names on a separate linestart-namenode.sh
file onto the system (e.g. into /hadoop/start-namenode.sh
)/hadoop/start-namenode.sh [HDFS REPLICATION FACTOR]
,
where HDFS REPLICATION FACTOR is the replication factor for HDFS blocks (defaults to 2)./hadoop/data
hadoop
user own the directories: chown -R hadoop:hadoop /hadoop
/hadoop/slaves-config/slaves
listing all slave node host names on a separate linestart-datanode.sh
file onto the system (e.g. into /hadoop/start-datanode.sh
)/hadoop/start-datanode.sh <NAMENODE HOST NAME> [HDFS REPLICATION FACTOR]
,
where NAMENODE HOST NAME is the host name of the namenode and HDFS REPLICATION FACTOR
is the replication factor for HDFS blocks (defaults to 2, and has to be consistent throughout
all cluster nodes).Once either a local or standalone cluster is provisioned, follow the below steps:
docker exec -it hadoop-namenode bash
su hadoop
~/utils/format-namenode.sh
HDFS has to be restarted the first time before MapReduce jobs can be successfully ran. This is because HDFS creates some data on the first run but stopping it can clean up its state so MapReduce jobs can be ran through YARN afterwards.
Execute the following commands as root
:
service hadoop start
service hadoop stop
service hadoop start
Restarting HDFS has to be done after formatting the namenode, otherwise running the first MapReduce job will always fail and the cluster will terminate.
Verify that the datanodes are correctly registered, issue on namenode: hdfs dfsadmin -report
When you're logged into the namenode, simply use the hadoop dfs ...
command
to interact with the HDFS cluster.
E.g. listing the contents of the root of the file system: hdfs dfs -ls /
bin
directory../hdfs dfs -ls hdfs://localhost:8020/
Web UIs | URL |
---|---|
Hadoop Name Node | http://localhost:50070 |
Hadoop Data Node | http://localhost:50075 |
WebHDFS REST API | http://localhost:50070/webhdfs/v1 |
NodeManager UI on Data Node | http://localhost:8042 |
YARN Resource Manager | http://localhost:8088 |
Spark UI | http://localhost:4040 |
Zeppelin UI | http://localhost:9001 |
Change localhost
to the IP address or host name of the namenode.
The script will create a directory called input with some sample files. It'll upload them into the HDFS cluster and run a simple MapReduce job. It'll print the results to the console.
docker exec -it hadoop-namenode bash
su hadoop
cd
~/utils/run-wordcount.sh
Running the word count example leaves some files in HDFS. The below example Spark job reads those files and simply splits the file contents by whitespaces. It then prints out the results.
Once you're on the namenode, issue spark-shell
:
val input = sc.textFile("/user/hadoop/input")
val splitContent = input.map(r => r.split(" "))
splitContent.foreach(line => println(line.toSeq))
To run it against the YARN cluster, launch with spark-shell --master=yarn
.
In this way, println
does not execute in the driver since you are executing it on
elements of the RDD. It executes in an executor, which can happen to
execute in-process in local mode. In general you should not expect this to print
results in the driver.
Zeppelin has to be ran as the hadoop user, so make sure to start the service as the hadoop user. It won't run properly with all interpreters under a different user!
docker exec -it hadoop-namenode bash
su hadoop
cd
zeppelin-daemon.sh start
%r
df <- read.df(sqlContext, "/user/hadoop/input/", source = "text")
head(df)