chenpengcong / blog

14 stars 3 forks source link

Hadoop HA配置 #27

Open chenpengcong opened 5 years ago

chenpengcong commented 5 years ago

上一篇文章Hadoop集群搭建记录了搭建一个Hadoop集群的过程,本文介绍如何在之前搭建的集群基础上将集群配置成高可用,实现故障自动转移功能

首先需要了解HDFS的namenode和YARN的resourcemanager都存在单点失效(SPOF, single point of failure)问题, 一旦namenode和resourcemanager失效,整个HDFS集群和YARN集群将无法提供服务

解决方案就是使用一对活动-备用(active-standby)namenode和一对活动-备用(active-standby)resourcemanager, 当active namenode或active resourcemanager服务不可用时,由standy namenode或standy resourcemanager迅速接管以提供服务

上文Hadoop集群搭建中Hadoop集群部署情况如下:

hostname ip services
bigdata1 192.168.56.102 namenode, resourcemanager, datanode, nodemanager
bigdata2 192.168.56.101 datanode, nodemanager
bigdata3 192.168.56.103 datanode, nodemanager

这里将使用bigdata2来启动stanby namenode和standby resourcemanager。将active服务和standby服务部署在不同机器上可以避免因机器故障导致active和standby同时无法提供服务

active namenode和standby namenode之间需要共享edit log,HDFS提供了两种方式:Quorum Journal Manager (QJM) 和NFS,这里我使用QJM

根据官网文档HDFS High AvailabilityResourceManager High Availability,针对具体集群配置如下

Hadoop配置

core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://mycluster</value>
    </property>
    <property>
        <name>dfs.journalnode.edits.dir</name>
        <value>/home/hadoop/data/journal</value>
     </property>
    <property>
        <name>ha.zookeeper.quorum</name>
        <value>bigdata1:2181,bigdata2:2181,bigdata3:2181</value>
     </property>
</configuration>

注意:fs.defaultFS不能指定端口,一开始配置9000port,执行hadoop fs -ls报错

[hadoop@bigdata1 ~]$ hadoop fs -ls 
ls: Port 9000 specified in URI hdfs://mycluster:9000 but host 'mycluster' is a logical (HA) namenode and does not use port information.

hdfs-site.xml

<configuration>
    <property>
        <name>dfs.nameservices</name>
        <value>mycluster</value>
    </property>
    <property>
        <name>dfs.ha.namenodes.mycluster</name>
        <value>nn1,nn2</value>
    </property>
    <property>
        <name>dfs.ha.automatic-failover.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.mycluster.nn1</name>
        <value>bigdata1:8020</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.mycluster.nn2</name>
        <value>bigdata2:8020</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.mycluster.nn1</name>
        <value>bigdata1:50070</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.mycluster.nn2</name>
        <value>bigdata2:50070</value>
    </property>
    <property>
        <name>dfs.namenode.shared.edits.dir</name>
        <value>qjournal://bigdata1:8485;bigdata2:8485;bigdata3:8485/mycluster</value>
    </property>
    <property>
          <name>dfs.client.failover.proxy.provider.mycluster</name><value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
    <property>
         <name>dfs.ha.fencing.methods</name>
         <value>sshfence
             shell(/bin/true)
         </value>
    </property>
    <property>
        <name>dfs.ha.fencing.ssh.private-key-files</name>
        <value>/home/hadoop/.ssh/id_rsa</value>
     </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/home/hadoop/data/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/hadoop/data/data</value>
    </property>
</configuration>

关于fencing method有个知识点值得注意:虽然QJM只允许被一个namenode写入,可以避免集群发生脑裂(split-brain )导致文件系统元数据被破坏的情况,但是还是得保证任意一时刻只有一个namenode存在,因为其中一个active namenode虽然不被允许写入QJM,但是仍然可以为客户端的读请求提供服务,直到该namenode执行写请求时才会报错关闭。

考虑这样一个场景:当故障转移发生时,原先的active namenode转换为standby状态失败

如果此时忽略失败而将standby namenode转换为active namenode,那么集群将存在两个namenode

为了处理这种情况,fencing method会在此时被调用,比如配置了sshfence,ZKFC(ZKFailoverController)进程会使用ssh登录到之前的active namenode所在机器并杀死namenode进程

配置shell(/bin/true)作为最后一个fencing method,这样即使由于某些原因(硬件故障,网络问题等)ssh失败了也能够正常启用standby服务

yarn-site.xml

<configuration>
<!-- Site specific YARN configuration properties -->
    <property>
      <name>yarn.resourcemanager.ha.enabled</name>
        <value>true</value>
    </property>
    <property>
          <name>yarn.resourcemanager.cluster-id</name>
          <value>cluster1</value>
    </property>
    <property>
          <name>yarn.resourcemanager.ha.rm-ids</name>
          <value>rm1,rm2</value>
    </property>
    <property>
          <name>yarn.resourcemanager.hostname.rm1</name>
          <value>bigdata1</value>
    </property>
    <property>
          <name>yarn.resourcemanager.hostname.rm2</name>
          <value>bigdata2</value>
    </property>
    <property>
          <name>yarn.resourcemanager.webapp.address.rm1</name>
          <value>bigdata1:8088</value>
    </property>
    <property>
         <name>yarn.resourcemanager.webapp.address.rm2</name>
         <value>bigdata2:8088</value>
    </property>
    <property>
          <name>yarn.resourcemanager.zk-address</name>
            <value>bigdata1:2181,bigdata2:2181,bigdata3:2181</value>
        </property>
    <property>
        <name>yarn.nodemanager.log-dirs</name>
        <value>/home/hadoop/data/yarn/logs</value>
    </property>
    <property>
        <name>yarn.nodemanager.local-dirs</name>
        <value>/home/hadoop/data/yarn/local</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
</configuration>

zookeeper配置

zookeeper配置参考官网文档ZooKeeper Getting Started Guide

zoo.cfg

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/home/hadoop/data/zookeeper
clientPort=2181

server.1=bigdata1:2888:3888
server.2=bigdata2:2888:3888
server.3=bigdata3:2888:3888

还需要在bigdata1, bigdata2, bigdata3的zookeeper的dataDir目录下创建myid文件

部署

首先需要按顺序执行以下步骤

  1. 在bigdata1,bigdata2, bigdata3启动journalnode

  2. 在之前的namenode节点(bigdata1)执行$ hdfs namenode -initializeSharedEdits

    因为是将已有的non-ha namenode转换为ha namenode,所以需要使用已有的edit log数据初始化journalnode,

  3. 启动bigdata1上的namenode(第四步需要),执行$ $HADOOP_HOME/sbin/hadoop-daemon.sh start namenode

  4. 在standby namenode节点(bigdata2)执行hdfs namenode -bootstrapStandb进行格式化

接下来启动其余服务(没有启动顺序要求)

启动命令就不写了,最终各个节点启动的服务如下

[hadoop@bigdata1 ~]$ jps
8160 Jps
4451 QuorumPeerMain
4563 DFSZKFailoverController
5861 JournalNode
7977 NameNode
7260 DataNode
5341 NodeManager
8093 ResourceManager

[hadoop@bigdata2 ~]$ jps
3986 QuorumPeerMain
6658 Jps
4068 DFSZKFailoverController
4662 JournalNode
4409 ResourceManager
4476 NodeManager
5677 NameNode
5853 DataNode

[hadoop@bigdata3 ~]$ jps
4499 Jps
3543 QuorumPeerMain
3767 NodeManager
3931 JournalNode
4108 DataNode

注意:ssh私钥文件id_rsa拷贝到ZKFC所在节点(bigdata1,bigdata2)的/home/hadoop/.ssh/目录下,因为ZKFC会执行sshfence

拓展阅读:

A Guide to Checkpointing in Hadoop

其中介绍了使用了standby namenode的checkpoint过程