Closed dennishuo closed 8 years ago
Release tests on November 12th showed the Ambari deployment working, so this points at a regression upstream that rolled out sometime after November 12th.
@pmkc dug up https://issues.apache.org/jira/browse/AMBARI-8064 as the likely culprit, which adds, among other things, the following:
+ <property>
+ <name>yarn.scheduler.capacity.root.accessible-node-labels.default.capacity</name>
+ <value>-1</value>
+ <description></description>
+ </property>
+ <property>
+ <name>yarn.scheduler.capacity.root.accessible-node-labels.default.maximum-capacity</name>
+ <value>-1</value>
+ <description></description>
+ </property>
We've confirmed these indeed appear in /etc/hadoop/conf/capacity-scheduler.xml
on the partially-deployed clusters. The timelines also match to identify that as the culprit; it was rolled into Ambari 1.7.0 for its HDP 2.2 deployment (which is what bdutil deploys by default at the moment), going live sometime mid-November. Also relevant is the removal of the problematic config entries here: https://issues.apache.org/jira/browse/AMBARI-13232 which targets Ambari 2.1.2, which unfortunately is not yet used by bdutil.
Workaround should be straightforward, to override the problematic config entries inside the configuration.json packaged with bdutil.
Currently, "./bdutil -e ambari deploy" fails about 10 minutes into the Ambari installation step with something like:
Inside the debuginfo.txt that bdutil prints out, you may find something like:
Logging into the Ambari GUI and clicking on failed operations shows the ResourceManager failed to come up; digging up ResourceManager startup logs shows something like: