Open wolf31o2 opened 6 years ago
Digging around, I ran into the failures_before_use_standby
setting. It looks like is_standby_exception
isn't detecting the exception correctly.
Here is a check: https://github.com/fluent/fluent-plugin-webhdfs/blob/9db1728f842e373c6c5ca081929a37f2b98c29aa/lib/fluent/plugin/out_webhdfs.rb#L259
So the problem is StandbyException
happens with non WebHDFS::IOError
?
Correct.
2018-07-20 18:44:27 +0000 [warn]: #0 [out_webhdfs] webhdfs check request failed. (namenode: name-1-node.hdfs.mesos:9002, error: {"RemoteException":{"exception":"StandbyException","javaClassName":"org.apache.hadoop.ipc.StandbyException","message":"Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error"}})
2018-07-20 18:44:27 +0000 [warn]: #0 [out_webhdfs] failed to flush the buffer. retry_time=9 next_retry_seconds=2018-07-20 18:44:27 +0000 chunk="57170302677042b73fcd566e929f9311" error_class=WebHDFS::ServerError error="{\"RemoteException\":{\"exception\":\"ArrayIndexOutOfBoundsException\",\"javaClassName\":\"java.lang.ArrayIndexOutOfBoundsException\",\"message\":null}}"
2018-07-20 18:44:27 +0000 [warn]: #0 suppressed same stacktrace
Versions:
fluent-plugin-webhdfs (1.2.3)
webhdfs (0.8.0)
Problem: I am running HDFS within my Mesos cluster. It is fully HA. I have configured a matcher to point to both NameNodes. However, when the first listed NameNode is in standby mode, the standby_namenode is never used.
Expected behavior: Connection to the
namenode
NameNode succeeds, finds its instandby
mode, and attempts to send tostandby_namenode
which is now the active NameNode.Actual results:
This is using td-agent 3.1.1 (fluentd 1.0.2) with the shipped fluent-plugin-webhdfs 1.2.2 plugin.
Forcing a NameNode failover caused logs to start flowing, again. However, this required manual intervention and I think the driver should do the correct thing in this state.