HA configuration performs incorrectly

wolf31o2 commented 6 years ago

Problem: I am running HDFS within my Mesos cluster. It is fully HA. I have configured a matcher to point to both NameNodes. However, when the first listed NameNode is in standby mode, the standby_namenode is never used.

Expected behavior: Connection to the namenode NameNode succeeds, finds its in standby mode, and attempts to send to standby_namenode which is now the active NameNode.

Actual results:

2018-06-12 19:28:48 +0000 [warn]: #0 [out_webhdfs] webhdfs check request failed. (namenode: name-0-node.hdfs.mesos:9002, error: {"RemoteException":{"exception":"StandbyException","javaClassName":"org.apache.hadoop.ipc.StandbyException","message":"Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error"}})

This is using td-agent 3.1.1 (fluentd 1.0.2) with the shipped fluent-plugin-webhdfs 1.2.2 plugin.

Forcing a NameNode failover caused logs to start flowing, again. However, this required manual intervention and I think the driver should do the correct thing in this state.

wolf31o2 commented 6 years ago

Digging around, I ran into the failures_before_use_standby setting. It looks like is_standby_exception isn't detecting the exception correctly.

repeatedly commented 6 years ago

Here is a check: https://github.com/fluent/fluent-plugin-webhdfs/blob/9db1728f842e373c6c5ca081929a37f2b98c29aa/lib/fluent/plugin/out_webhdfs.rb#L259 So the problem is StandbyException happens with non WebHDFS::IOError?

wolf31o2 commented 6 years ago

Correct.

2018-07-20 18:44:27 +0000 [warn]: #0 [out_webhdfs] webhdfs check request failed. (namenode: name-1-node.hdfs.mesos:9002, error: {"RemoteException":{"exception":"StandbyException","javaClassName":"org.apache.hadoop.ipc.StandbyException","message":"Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error"}})
2018-07-20 18:44:27 +0000 [warn]: #0 [out_webhdfs] failed to flush the buffer. retry_time=9 next_retry_seconds=2018-07-20 18:44:27 +0000 chunk="57170302677042b73fcd566e929f9311" error_class=WebHDFS::ServerError error="{\"RemoteException\":{\"exception\":\"ArrayIndexOutOfBoundsException\",\"javaClassName\":\"java.lang.ArrayIndexOutOfBoundsException\",\"message\":null}}"
  2018-07-20 18:44:27 +0000 [warn]: #0 suppressed same stacktrace

wolf31o2 commented 6 years ago

Versions:

fluent-plugin-webhdfs (1.2.3)
webhdfs (0.8.0)

fluent / fluent-plugin-webhdfs

HA configuration performs incorrectly #67