Tendrl / commons

Common code usable by all Tendrl components
http://www.tendrl.org
GNU Lesser General Public License v2.1
4 stars 23 forks source link

Import Cluster fails on "node X doesn't have network details populated" with Ansible 2.8 #1084

Open GowthamShanmugam opened 5 years ago

GowthamShanmugam commented 5 years ago

With Beta 1 build of Ansible 2.8, it's not possible to import Gluster Trusted Storage pool into Tendrl, as Cluster Import task fails with error:

Node doesn't have network details populated

SalsaBr commented 5 years ago

I ran across this same sympton and could not recover from the error downgrading to ansible 2.7 nor trying to unmanage the cluster. The cluster is stuck in an error state.

GowthamShanmugam commented 5 years ago

Does /tmp directory have execution permission? if not, please remount /tmp directory with exec permission: https://askubuntu.com/questions/311438/how-to-make-tmp-executable

SalsaBr commented 5 years ago

Yes, it does. On all nodes

GowthamShanmugam commented 5 years ago

Please unmanage the cluster and wait for all the nodes will be detected by tendrl server. Fire import after all the nodes are listed with fqdn.

SalsaBr commented 5 years ago

The unmanage function is not working either. Nothing happens and I can't check on it's progress.

GowthamShanmugam commented 5 years ago

Oh ok you already mentioned like un-manage is not working sorry :), Please check the log file in /var/log/messages. I think in each sync it may populate the error in the log file.

GowthamShanmugam commented 5 years ago

If nothing working I will help you in a remote call to solve this problem

SalsaBr commented 5 years ago

I tried to clear everything by deleting my etcd partition, then installed tendrl againand this is what I got now:

Child jobs failed are [u'02239281-7b28-4cec-8d11-7c9f6b82fca1']

Failed atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster

Failure in Job fdd906f4-4089-4e0b-9680-4f88aea55963 Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/init.py", line 240, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/init.py", line 131, in run exc_traceback) FlowExecutionFailedError: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/init.py", line 98, in run\n super(ImportCluster, self).run()\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/init.py", line 186, in run\n (atom_fqn, self._defs[\'help\'])\n', 'AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster\n']

Failure in Job 02239281-7b28-4cec-8d11-7c9f6b82fca1 Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/init.py", line 240, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/init.py", line 131, in run exc_traceback) FlowExecutionFailedError: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/init.py", line 98, in run\n super(ImportCluster, self).run()\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/init.py", line 213, in run\n ret_val = self._execute_atom(atom_fqn)\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/init.py", line 252, in _execute_atom\n parameters=self.parameters\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/objects/cluster/atoms/configure_monitoring/init.py", line 110, in run\n "interface": self.get_node_interface(NS.node_context.fqdn),\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/objects/cluster/atoms/configure_monitoring/init.py", line 80, in get_node_interface\n ip = socket.gethostbyname(fqdn)\n', 'TypeError: must be string, not None\n']

GowthamShanmugam commented 5 years ago

It clearly says FQDN of the node is not populated, some problem with node_context details sync. what is the version of tendrl rpms and ansible: rpm -qa | grep tendrl rpm -qa | grep ansible run these command in the server as well as storage nodes.

SalsaBr commented 5 years ago

Node-01: tendrl-collectd-selinux-1.5.4-2.el7.centos.noarch tendrl-selinux-1.5.4-2.el7.centos.noarch tendrl-commons-1.6.3-11.el7.noarch tendrl-gluster-integration-1.6.3-10.el7.noarch tendrl-node-agent-1.6.3-9.el7.noarch

ansible-2.5.3-1.el7.noarch

Node-02: tendrl-collectd-selinux-1.5.4-2.el7.centos.noarch tendrl-selinux-1.5.4-2.el7.centos.noarch tendrl-node-agent-1.6.3-9.el7.noarch tendrl-gluster-integration-1.6.3-10.el7.noarch tendrl-commons-1.6.3-11.el7.noarch

centos-release-ansible26-1-3.el7.centos.noarch ansible-2.8.0-2.el7.noarch

Node-03: tendrl-collectd-selinux-1.5.4-2.el7.centos.noarch tendrl-gluster-integration-1.6.3-10.el7.noarch tendrl-selinux-1.5.4-2.el7.centos.noarch tendrl-node-agent-1.6.3-9.el7.noarch tendrl-commons-1.6.3-11.el7.noarch

ansible-2.8.0-2.el7.noarch

node-remote: tendrl-notifier-1.6.3-4.el7.noarch tendrl-ansible-1.6.3-2.el7.centos.noarch tendrl-monitoring-integration-1.6.3-11.el7.noarch tendrl-grafana-selinux-1.5.4-2.el7.centos.noarch tendrl-collectd-selinux-1.5.4-2.el7.centos.noarch tendrl-gluster-integration-1.6.3-10.el7.noarch tendrl-selinux-1.5.4-2.el7.centos.noarch tendrl-commons-1.6.3-11.el7.noarch tendrl-api-1.6.3-7.el7.noarch tendrl-api-httpd-1.6.3-7.el7.noarch tendrl-node-agent-1.6.3-9.el7.noarch tendrl-ui-1.6.3-10.el7.noarch tendrl-grafana-plugins-1.6.3-11.el7.noarch

tendrl-ansible-1.6.3-2.el7.centos.noarch ansible-2.8.0-2.el7.noarch centos-release-ansible26-1-3.el7.centos.noarch

Note that running versions may differ as every node has ansible 2.7.0 for example: ansible --version ansible 2.7.0 config file = /etc/ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Apr 9 2019, 14:30:50) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]

Another sympton: Tendrl mentios 4 hosts discovered in the cluster but when I try to view these hosts I get a table with 3 hosts only. Tendrl server - which is a geo-rep hosts for the cluster - is missing. May be related as the 3 hosts being shown have correct names and IPs.

GowthamShanmugam commented 5 years ago

Ah! I got the problem, in the upstream release we are not yet included ansible 2.8 fix, it is still in the master repo only.

Here except node1, all the nodes ansible version are 2.8, it should be less than 2.8 and greater than 2.5 (including tendrl-server).

after downgraded restart tendrl-node-agent service in node as well as the server.

Note: Don't install any gluster packages in tendrl-server