Import failing but UI says Cluster is "Ready to use"

julienlim commented 6 years ago

Problem: Import fails but UI says Cluster is "Ready to use"

Environment: Installed using tendrl-vagrant on MacOS:

1 tendrl-server
3 gluster nodes (tendrl-node-1..tendrl-node-3)

Using tendrl-1.6.3-20180615

[vagrant@tendrl-server ~]$ rpm -qa | grep tendrl | sort

tendrl-ansible-1.6.3-20180615T095226.268f8a2.noarch
tendrl-api-1.6.3-20180530T164022.8308f00.noarch
tendrl-api-httpd-1.6.3-20180530T164022.8308f00.noarch
tendrl-commons-1.6.3-20180615T125547.069c634.noarch
tendrl-grafana-plugins-1.6.3-20180615T120423.a75aca4.noarch
tendrl-grafana-selinux-1.5.4-20180227T085901.984600c.noarch
tendrl-monitoring-integration-1.6.3-20180615T120423.a75aca4.noarch
tendrl-node-agent-1.6.3-20180615T125550.2642567.noarch
tendrl-notifier-1.6.3-20180614T113218.d4353f2.noarch
tendrl-selinux-1.5.4-20180227T085901.984600c.noarch
tendrl-ui-1.6.3-20180615T112029.1d0ad59.noarch

Observations

Tendrl-ansible completed successfully with no issues.
I am able to access the Tendrl UI with default credentials.
During Import Cluster, I specified a custom cluster name.
During the import cluster tasks, I see some things failing, but upon completion of the task, the UI says the cluster is Ready to Use.
When you click on Dashboard, you don’t see the right dashboard (the integration piece is missing, and Grafana tells me the Dashboard is missing.
I can see the Hosts in the cluster, but not see the volumes (or bricks) for the cluster.
Per @anmolsachan, he’s seen this issue before that tendrl-monitoring-integration is not running, hence why cluster does not properly import. I’ve confirmed that the tendrl-monitoring-integration is indeed not running on the tendrl server, and there appears to be no checks in our application to verify all running services before the import.
I try to start tendrl-monitoring-integration, and it starts for a short while and then it dies. systemctl keeps trying to restart it but it seems to run for a short while and then dies.
Needless to say, Unmanage cluster also didn't work (and I verified this).

Excerpt of "journalctl -u tendrl-monitoring-integration": see https://paste.fedoraproject.org/paste/JFnrkiQxtaxrhj1~dE9BEw

Potentially related to the following:

https://www.redhat.com/archives/tendrl-devel/2018-June/msg00008.html (similar to what's in the first screenshot below)
https://github.com/Tendrl/ui/issues/984

Per @mbukatov: "If the monitoring integration can get into state that it crashes, is restarted and it's not up again, it's either:

bug in systemd service file
but in monitoring integration, which node agent wouldn't be able to resolve That said, if we have this information included in the error message, it would be very useful."

Note: I've deployed twice now with tendrl-vagrant and get this same thing to happen twice now.

Some screenshots:

Failure in Import Cluster Task:

Despite failure, UI shows Cluster is Ready to Use:

Dashboard not found:

@nthomas-redhat @r0h4n @shirshendu @gnehapk @Tendrl/qe @shtripat @GowthamShanmugam @anmolsachan

julienlim commented 6 years ago

Some additional information related to the tendrl-monitoring-integration:

[root@tendrl-server tendrl-ansible-1.6.3]# systemctl status tendrl-monitoring-integration

● tendrl-monitoring-integration.service - Monitoring Integration
   Loaded: loaded (/usr/lib/systemd/system/tendrl-monitoring-integration.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Fri 2018-06-15 23:36:39 UTC; 2 days ago
     Docs: https://github.com/Tendrl/monitoring-integration/tree/master/doc/source
  Process: 25512 ExecStart=/usr/bin/tendrl-monitoring-integration (code=exited, status=1/FAILURE)
 Main PID: 25512 (code=exited, status=1/FAILURE)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Trying to start tendrl-monitoring-integration

[root@tendrl-server tendrl-ansible-1.6.3]# systemctl start tendrl-monitoring-integration
[root@tendrl-server tendrl-ansible-1.6.3]# systemctl status tendrl-monitoring-integration
● tendrl-monitoring-integration.service - Monitoring Integration
   Loaded: loaded (/usr/lib/systemd/system/tendrl-monitoring-integration.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2018-06-18 02:04:57 UTC; 2s ago
     Docs: https://github.com/Tendrl/monitoring-integration/tree/master/doc/source
 Main PID: 23114 (tendrl-monitori)
   CGroup: /system.slice/tendrl-monitoring-integration.service
           └─23114 /usr/bin/python /usr/bin/tendrl-monitoring-integration

Jun 18 02:04:57 tendrl-server systemd[1]: Started Monitoring Integration.
Jun 18 02:04:57 tendrl-server systemd[1]: Starting Monitoring Integration…

[root@tendrl-server tendrl-ansible-1.6.3]# systemctl status tendrl-monitoring-integration
● tendrl-monitoring-integration.service - Monitoring Integration
   Loaded: loaded (/usr/lib/systemd/system/tendrl-monitoring-integration.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Mon 2018-06-18 02:05:07 UTC; 100ms ago
     Docs: https://github.com/Tendrl/monitoring-integration/tree/master/doc/source
  Process: 23147 ExecStart=/usr/bin/tendrl-monitoring-integration (code=exited, status=1/FAILURE)
 Main PID: 23147 (code=exited, status=1/FAILURE)

Jun 18 02:05:07 tendrl-server systemd[1]: tendrl-monitoring-integration.service: main process exited, code=exited, status=1/FAILURE
Jun 18 02:05:07 tendrl-server systemd[1]: Unit tendrl-monitoring-integration.service entered failed state.
Jun 18 02:05:07 tendrl-server systemd[1]: tendrl-monitoring-integration.service failed.

gnehapk commented 6 years ago

@julienlim Can you please share the API response of /clusters received on cluster list view.

mbukatov commented 6 years ago

Reposting most interesting part from journalctl -u tendrl-monitoring-integration output, as provided in https://paste.fedoraproject.org/paste/JFnrkiQxtaxrhj1~dE9BEw so that it's possible to find this issue by searching for the error:

Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext
Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: Traceback (most recent call last):
Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: File "/usr/bin/tendrl-monitoring-integration", line 9, in <module>
Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: load_entry_point('tendrl-monitoring-integration==1.6.3', 'console_scripts', 'tendrl-monitoring-integration')()
Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: File "/usr/lib/python2.7/site-packages/tendrl/monitoring_integration/manager/__init__.py", line 71, in main
Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: monitoring_integration_manager.start()
Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: File "/usr/lib/python2.7/site-packages/tendrl/monitoring_integration/manager/__init__.py", line 31, in start
Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: dashboard.upload_default_dashboards()
Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: File "/usr/lib/python2.7/site-packages/tendrl/monitoring_integration/grafana/dashboard.py", line 27, in upload_default_dashboards
Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: raise ex
Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: KeyError: 'id'
Jun 18 02:05:05 tendrl-server systemd[1]: tendrl-monitoring-integration.service: main process exited, code=exited, status=1/FAILURE
Jun 18 02:05:05 tendrl-server systemd[1]: Unit tendrl-monitoring-integration.service entered failed state.
Jun 18 02:05:05 tendrl-server systemd[1]: tendrl-monitoring-integration.service failed.

GowthamShanmugam commented 6 years ago

Hi Martin, This will happen when password mismatch happens in monitoring-integration configuration file. I have already sent steps to change a password to Julim. Let Julim try the steps and we will see whether this issue is resolving or not.

I faced the same issue when I gave a wrong grafana password in monitoring integration configuration file. I feel this should be the same issue. This happened because of some problem while configuring monitoring-integration and grafana.

For the second one cluster ready to use after import job fails, I have raised the downstream issue and fixed the problem, PR is under review in upstream. https://bugzilla.redhat.com/show_bug.cgi?id=1593640

Thanks & Regards Gowtham S

On Thu, Jun 21, 2018 at 6:53 PM, Martin Bukatovič notifications@github.com wrote:

Reposting most interesting part from journalctl -u tendrl-monitoring-integration output, as provided in https://paste.fedoraproject.org/paste/JFnrkiQxtaxrhj1~dE9BEw so that it's possible to find this issue by searching for the error:

Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: Traceback (most recent call last): Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: File "/usr/bin/tendrl-monitoring-integration", line 9, in Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: load_entry_point('tendrl-monitoring-integration==1.6.3', 'console_scripts', 'tendrl-monitoring-integration')() Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: File "/usr/lib/python2.7/site-packages/tendrl/monitoring_integration/manager/init.py", line 71, in main Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: monitoring_integration_manager.start() Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: File "/usr/lib/python2.7/site-packages/tendrl/monitoring_integration/manager/init.py", line 31, in start Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: dashboard.upload_default_dashboards() Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: File "/usr/lib/python2.7/site-packages/tendrl/monitoring_integration/grafana/dashboard.py", line 27, in upload_default_dashboards Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: raise ex Jun 18 02:05:05 tendrl-server tendrl-monitoring-integration[23137]: KeyError: 'id' Jun 18 02:05:05 tendrl-server systemd[1]: tendrl-monitoring-integration.service: main process exited, code=exited, status=1/FAILURE Jun 18 02:05:05 tendrl-server systemd[1]: Unit tendrl-monitoring-integration.service entered failed state. Jun 18 02:05:05 tendrl-server systemd[1]: tendrl-monitoring-integration.service failed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Tendrl/ui/issues/995#issuecomment-399101722, or mute the thread https://github.com/notifications/unsubscribe-auth/AOYBNhGdlP1ZCn_IyrEa_4S_aBH5ksPUks5t-55egaJpZM4Uv6TT .

mbukatov commented 6 years ago

Thanks for the quick review.

I faced the same issue when I gave a wrong grafana password in monitoring integration configuration file.

If that is the case, isn't there a bug in vagrant script as well? If I read you right, this is the password which is configured by tendrl-ansible, and then stored in local password file, so that when tendrl-ansible is run again, the same password is used without breaking the current setup.

julienlim commented 6 years ago

@GowthamShanmugam @mbukatov I'm experience this issue twice: (1) grafana password not set (2) grafana password set correctly

The symptom being observed is that the tendrl-monitoring-integration agent does not want to stay up despite restarting it.

julienlim commented 6 years ago

@GowthamShanmugam @mbukatov I just redeployed and this time when it fails the import, it gives the correct message, i.e. Import Failed.

$ rpm -qa | grep tendrl | sort
tendrl-api-1.6.3-20180626T110501.5a1c79e.noarch
tendrl-api-httpd-1.6.3-20180626T110501.5a1c79e.noarch
tendrl-commons-1.6.3-20180628T114340.d094568.noarch
tendrl-grafana-plugins-1.6.3-20180622T070617.1f84bc8.noarch
tendrl-grafana-selinux-1.5.4-20180227T085901.984600c.noarch
tendrl-monitoring-integration-1.6.3-20180622T070617.1f84bc8.noarch
tendrl-node-agent-1.6.3-20180618T083110.ba580e6.noarch
tendrl-notifier-1.6.3-20180618T083117.fd7bddb.noarch
tendrl-selinux-1.5.4-20180227T085901.984600c.noarch
tendrl-ui-1.6.3-20180625T085228.23f862a.noarch

Side note: I verified that the grafana admin password is properly set, and the issue of the tendrl-monitoring-integration agent not want to stay up still persists (so import is not able to actually complete successfully).

gnehapk commented 6 years ago

@GowthamShanmugam can you please close this issue if fixed?

GowthamShanmugam commented 6 years ago

This issue is fixed https://github.com/Tendrl/gluster-integration/pull/691, it is happened because of race condition in cluster object save.

@gnehapk we can close this issue

Tendrl / ui

Import failing but UI says Cluster is "Ready to use" #995