argusiot commented 2 years ago

Hbase on Parag's system is unstable. Not sure what the cause is. Cursory research points in the direction of need for additional tuning of timers based on RAM size.

Using this bug to gather information.

Configured memory (from /proc/meminfo): MemTotal: 98893700 kB

Swap size (default): tsadmin@v-argusiot-2-1:/usr/share/hbase/hbase-1.4.13/conf$ free total used free shared buff/cache available Mem: 98893700 8648580 85314764 2812 4930356 89322996 Swap: 8388604 0 8388604

Logs from most recent hbase shutdown:

2022-04-21 14:34:24,230 WARN [M:0;v-argusiot-2-1:32863] util.Sleeper: We slept 17308ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

2022-04-21 14:34:24,549 WARN [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x18049fde7560000 has expired 2022-04-21 14:34:24,549 FATAL [main-EventThread] master.HMaster: Master server abort: loaded coprocessors are: [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint] 2022-04-21 14:34:24,549 FATAL [main-EventThread] master.HMaster: master:32863-0x18049fde7560000, quorum=localhost:2181, baseZNode=/hbase master:32863-0x18049fde7560000 received expired from ZooKeeper, aborting 2022-04-21 14:34:24,649 ERROR [main-SendThread(localhost:2181)] coordination.SplitLogManagerCoordination: ZK session expired. Master is expected to shut down. Abandoning retries for action=GetData from znode /hbase/splitWAL/WALs%2Fv-argusiot-2-1%2C36407%2C1650508819821-splitting%2Fv-argusiot-2-1%252C36407%252C1650508819821.meta.1650516035772.meta 2022-04-21 14:34:24,649 ERROR [main-SendThread(localhost:2181)] coordination.SplitLogManagerCoordination: ZK session expired. Master is expected to shut down. Abandoning retries for action=CreateRescan znode /hbase/splitWAL/RESCAN 2022-04-21 14:34:25,650 ERROR [M:0;v-argusiot-2-1:32863] zookeeper.ZooKeeperWatcher: master:32863-0x18049fde7560000, quorum=localhost:2181, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception 2022-04-21 14:34:25,650 ERROR [M:0;v-argusiot-2-1:32863] master.ActiveMasterManager: master:32863-0x18049fde7560000, quorum=localhost:2181, baseZNode=/hbase Error deleting our own master address node 2022-04-21 14:34:25,668 ERROR [M:0;v-argusiot-2-1:32863] server.ZooKeeperServer: ZKShutdownHandler is not registered, so ZooKeeper server won't take any action on ERROR or SHUTDOWN server state changes

argusiot commented 2 years ago

Using zookeeper tips from hbase manual here added this tweak:

<property>
  <name>zookeeper.session.timeout</name>
  <value>120000</value>
</property>
<property>
  <name>hbase.zookeeper.property.tickTime</name>
  <value>6000</value>
</property>

Results from above tuning are yet to be determined.

Note: We understand there are risks associated with using a higher timeout. It makes the system less responsive to failures. But thats a concern only when you have a multi-node deployment.

ArchanDasArgus commented 2 years ago

On my system, forcing the laptop to not go to sleep fixes the issue on the localhost docker container.

argusiot / data_platform

Hbase & Zookeeper tuning #29

Logs from most recent hbase shutdown: