Open GowthamShanmugam opened 5 years ago
I ran across this same sympton and could not recover from the error downgrading to ansible 2.7 nor trying to unmanage the cluster. The cluster is stuck in an error state.
Does /tmp directory have execution permission? if not, please remount /tmp directory with exec permission: https://askubuntu.com/questions/311438/how-to-make-tmp-executable
Yes, it does. On all nodes
Please unmanage the cluster and wait for all the nodes will be detected by tendrl server. Fire import after all the nodes are listed with fqdn.
The unmanage function is not working either. Nothing happens and I can't check on it's progress.
Oh ok you already mentioned like un-manage is not working sorry :), Please check the log file in /var/log/messages. I think in each sync it may populate the error in the log file.
If nothing working I will help you in a remote call to solve this problem
I tried to clear everything by deleting my etcd partition, then installed tendrl againand this is what I got now:
Child jobs failed are [u'02239281-7b28-4cec-8d11-7c9f6b82fca1']
Failed atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster
Failure in Job fdd906f4-4089-4e0b-9680-4f88aea55963 Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/init.py", line 240, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/init.py", line 131, in run exc_traceback) FlowExecutionFailedError: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/init.py", line 98, in run\n super(ImportCluster, self).run()\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/init.py", line 186, in run\n (atom_fqn, self._defs[\'help\'])\n', 'AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster\n']
Failure in Job 02239281-7b28-4cec-8d11-7c9f6b82fca1 Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/init.py", line 240, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/init.py", line 131, in run exc_traceback) FlowExecutionFailedError: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/init.py", line 98, in run\n super(ImportCluster, self).run()\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/init.py", line 213, in run\n ret_val = self._execute_atom(atom_fqn)\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/init.py", line 252, in _execute_atom\n parameters=self.parameters\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/objects/cluster/atoms/configure_monitoring/init.py", line 110, in run\n "interface": self.get_node_interface(NS.node_context.fqdn),\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/objects/cluster/atoms/configure_monitoring/init.py", line 80, in get_node_interface\n ip = socket.gethostbyname(fqdn)\n', 'TypeError: must be string, not None\n']
It clearly says FQDN of the node is not populated, some problem with node_context details sync. what is the version of tendrl rpms and ansible: rpm -qa | grep tendrl rpm -qa | grep ansible run these command in the server as well as storage nodes.
Node-01: tendrl-collectd-selinux-1.5.4-2.el7.centos.noarch tendrl-selinux-1.5.4-2.el7.centos.noarch tendrl-commons-1.6.3-11.el7.noarch tendrl-gluster-integration-1.6.3-10.el7.noarch tendrl-node-agent-1.6.3-9.el7.noarch
ansible-2.5.3-1.el7.noarch
Node-02: tendrl-collectd-selinux-1.5.4-2.el7.centos.noarch tendrl-selinux-1.5.4-2.el7.centos.noarch tendrl-node-agent-1.6.3-9.el7.noarch tendrl-gluster-integration-1.6.3-10.el7.noarch tendrl-commons-1.6.3-11.el7.noarch
centos-release-ansible26-1-3.el7.centos.noarch ansible-2.8.0-2.el7.noarch
Node-03: tendrl-collectd-selinux-1.5.4-2.el7.centos.noarch tendrl-gluster-integration-1.6.3-10.el7.noarch tendrl-selinux-1.5.4-2.el7.centos.noarch tendrl-node-agent-1.6.3-9.el7.noarch tendrl-commons-1.6.3-11.el7.noarch
ansible-2.8.0-2.el7.noarch
node-remote: tendrl-notifier-1.6.3-4.el7.noarch tendrl-ansible-1.6.3-2.el7.centos.noarch tendrl-monitoring-integration-1.6.3-11.el7.noarch tendrl-grafana-selinux-1.5.4-2.el7.centos.noarch tendrl-collectd-selinux-1.5.4-2.el7.centos.noarch tendrl-gluster-integration-1.6.3-10.el7.noarch tendrl-selinux-1.5.4-2.el7.centos.noarch tendrl-commons-1.6.3-11.el7.noarch tendrl-api-1.6.3-7.el7.noarch tendrl-api-httpd-1.6.3-7.el7.noarch tendrl-node-agent-1.6.3-9.el7.noarch tendrl-ui-1.6.3-10.el7.noarch tendrl-grafana-plugins-1.6.3-11.el7.noarch
tendrl-ansible-1.6.3-2.el7.centos.noarch ansible-2.8.0-2.el7.noarch centos-release-ansible26-1-3.el7.centos.noarch
Note that running versions may differ as every node has ansible 2.7.0 for example: ansible --version ansible 2.7.0 config file = /etc/ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Apr 9 2019, 14:30:50) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]
Another sympton: Tendrl mentios 4 hosts discovered in the cluster but when I try to view these hosts I get a table with 3 hosts only. Tendrl server - which is a geo-rep hosts for the cluster - is missing. May be related as the 3 hosts being shown have correct names and IPs.
Ah! I got the problem, in the upstream release we are not yet included ansible 2.8 fix, it is still in the master repo only.
Here except node1, all the nodes ansible version are 2.8, it should be less than 2.8 and greater than 2.5 (including tendrl-server).
after downgraded restart tendrl-node-agent service in node as well as the server.
Note: Don't install any gluster packages in tendrl-server
With Beta 1 build of Ansible 2.8, it's not possible to import Gluster Trusted Storage pool into Tendrl, as Cluster Import task fails with error:
Node doesn't have network details populated