Tendrl / notifier

Notification delivery component for the Tendrl Core stack.
GNU Lesser General Public License v2.1
2 stars 9 forks source link

Job finished successfully message contains only job id #162

Open mbukatov opened 6 years ago

mbukatov commented 6 years ago

Description of the problem

When I open Events page of Tendrl ui, I see events like:

Job finished successfully (job_id: 4207477c-8101-4921-b48a-f66c4d028cb8)

I don't immediately see what kind of job it is.

This could be especially confusing when I see lot of events like that, without any hint what's wrong (if anything):

screenshot_20180312_164500

Note that in the screenshot above, the message about successfully finished job repeats after few minutes.

When I tried to dig deeper and on the tendrl server machine tried:

# grep -R 4207477c-8101-4921-b48a-f66c4d028cb8 /var/log/
/var/log/tendrl/node-agent/node-agent.log:Mar 12 15:56:49 mbukatov-usm1-server tendrl-node-agent: 2018-03-12 15:56:49.766151+00:00 - node_agent - /usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py:169 - process_job - INFO - Node (76bc408b-e51d-4530-8b30-29ee1f153e60)(type: node)(tags: [u'tendrl/node_76bc408b-e51d-4530-8b30-29ee1f153e60', u'tendrl/integration/monitoring', u'tendrl/central-store', u'tendrl/server', u'tendrl/monitor', u'tendrl/node']) will not process job-4207477c-8101-4921-b48a-f66c4d028cb8 (tags: tendrl/node_6f6e2269-bcf4-4889-82c7-9ba8ed8fb152)
/var/log/messages:Mar 12 15:56:49 mbukatov-usm1-server journal: 2018-03-12 15:56:49.766151+00:00 - node_agent - /usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py:169 - process_job - INFO - Node (76bc408b-e51d-4530-8b30-29ee1f153e60)(type: node)(tags: [u'tendrl/node_76bc408b-e51d-4530-8b30-29ee1f153e60', u'tendrl/integration/monitoring', u'tendrl/central-store', u'tendrl/server', u'tendrl/monitor', u'tendrl/node']) will not process job-4207477c-8101-4921-b48a-f66c4d028cb8 (tags: tendrl/node_6f6e2269-bcf4-4889-82c7-9ba8ed8fb152)

I see only single log message related to this (with two occurrences though, one in node agent and other in messages log) and I read it as:

Node 76bc408b-e51d-4530-8b30-29ee1f153e60  will not process job 4207477c-8101-4921-b48a-f66c4d028cb8

Which doesn't help me much with debugging of the event showed above, as it contradicts the original message (job finished successfully).

Expected Result

Event description may contain more details, eg. job type, to improve information delivered to the user.

Moreover we will need a description of the job id and how to use it for debugging. In my case, I'm unable to find any useful details for the event to go further.

Version

On Storage Servers:

# rpm -qa | egrep '(gluster|tendrl)'
glusterfs-api-4.1dev-0.115.git685d440.el7.centos.x86_64
glusterfs-events-4.1dev-0.115.git685d440.el7.centos.x86_64
tendrl-gluster-integration-1.6.1-1.el7.centos.noarch
tendrl-node-agent-1.6.1-1.el7.centos.noarch
python2-gluster-4.1dev-0.115.git685d440.el7.centos.x86_64
tendrl-collectd-selinux-1.5.4-2.el7.centos.noarch
glusterfs-fuse-4.1dev-0.115.git685d440.el7.centos.x86_64
glusterfs-server-4.1dev-0.115.git685d440.el7.centos.x86_64
glusterfs-geo-replication-4.1dev-0.115.git685d440.el7.centos.x86_64
tendrl-commons-1.6.1-1.el7.centos.noarch
glusterfs-libs-4.1dev-0.115.git685d440.el7.centos.x86_64
glusterfs-client-xlators-4.1dev-0.115.git685d440.el7.centos.x86_64
glusterfs-cli-4.1dev-0.115.git685d440.el7.centos.x86_64
tendrl-selinux-1.5.4-2.el7.centos.noarch
glusterfs-4.1dev-0.115.git685d440.el7.centos.x86_64

On Tendrl server:

# rpm -qa | egrep '(gluster|tendrl)'
tendrl-grafana-plugins-1.6.1-1.el7.centos.noarch
tendrl-monitoring-integration-1.6.1-1.el7.centos.noarch
tendrl-notifier-1.6.0-1.el7.centos.noarch
tendrl-api-httpd-1.6.1-1.el7.centos.noarch
tendrl-selinux-1.5.4-2.el7.centos.noarch
tendrl-node-agent-1.6.1-1.el7.centos.noarch
tendrl-ui-1.6.1-1.el7.centos.noarch
tendrl-grafana-selinux-1.5.4-2.el7.centos.noarch
tendrl-commons-1.6.1-1.el7.centos.noarch
tendrl-api-1.6.1-1.el7.centos.noarch
mbukatov commented 6 years ago

@fbalak I reported this as a suggestion to provide better event description to help with debugging. I haven't reported the problem itself, as it's likely caused by some glusterfs problem.

r0h4n commented 6 years ago

@nthomas-redhat Please fix this along with other log message fixes as discussed at (https://docs.google.com/document/d/138SFPUlRqdLjISMcd-Cts-vWzY7wfTGWi8GhdQHnh0Q/edit)

mbukatov commented 6 years ago

On Architecture Sync up meeting today, we decided that we are going to address it by:

In the long term, we may need to add tednrl api endpoint and enhance tendrl ui to show details for particular job id.

julienlim commented 6 years ago

@r0h4n @mbukatov @nthomas-redhat @a2batic @gnehapk @shirshendu @mcarrano

When looking at this and thinking about event details further, it appears we don't get too much from the Events API at the moment, i.e. message, timestamp, message_id, priority.

We appear to be showing the message and timestamp at the moment.

+1 @mbukatov on needing more details on the particular job.

Here are some things that occur to me when showing the Event Details.

In the Events List, we should be showing a short event message and not a long, verbose event message. Moreover, the priority should be shown as well.

In the Event Details, we would show the event row/item again but with more details, e.g. we should show a long event message, along with the priority of it (if we don't show in the Event List). In addition, if we have a category/type for the Event, that would be good to show.

E.g. Short msg == gluster-195d43d86fd38ba5929e44529d1fa0b985f42f03946e0bb5ada6999805556674 is healthy Long (current) msg == Health status of cluster: gluster-195d43d86fd38ba5929e44529d1fa0b985f42f03946e0bb5ada6999805556674 changed from unhealthy to healthy

If the event contains the Job completed or failed, we should show details about what the Flow that was run.

E.g. Current msg == Job finished successfully (job_id: 14e7207a-02d4-4e97-a0c7-214bf71a91e8) Suggested short msg == (Job ID 14e7207a-02d4-4e97-a0c7-214bf71a91e8) completed successfully If we're able to, we should ideally make the Job Name and/or Job ID hyperlinkable to the task details to see more details about what was performed.

Ideally the event details would provide enough details so that it is actionable with guidance on how to resolve it if there's a problem or failure.

Thoughts?

mcarrano commented 6 years ago

I've create an Event Details page to display the details of an event as a drill-down from the events list. This is designed to display the full event message and link to any related resources. See https://redhat.invisionapp.com/share/HVGA7O575AZ#/285313287_Cluster_Details-Event_Detail

I also should note that the Event List, as designed, should display the event severity/priority before the short message. Let me know if you have any questions.

julienlim commented 6 years ago

@r0h4n @mbukatov @nthomas-redhat @a2batic @gnehapk @shirshendu @mcarrano

Please note we've published the Event Details design. See previous comment by @mcarrano.

r0h4n commented 6 years ago

@julienlim

@nthomas-redhat is working on this issue, waiting for updates from him

r0h4n commented 6 years ago

@nthomas-redhat please close this if done