Particular / ServicePulse

Production monitoring for distributed systems.
https://docs.particular.net/servicepulse/
Other
33 stars 27 forks source link

Inconsistency with dashboard indicators #59

Closed johnsimons closed 10 years ago

johnsimons commented 10 years ago

Currently the indicators show the number of errors for that indicator on the right top corner as a red icon. Eg: two

However as you notice from the image the number in the middle sometimes signifies errors count and sometimes is total.

Current notifications

Heartbeats

If we have 4 successful hearbeats, we have a 4 in the middle and no red icon but if we have 1 failed and 3 successes, we have 3 in the middle and 1 in the red icon. So in essence we subtracted total-failed and display that in the middle.

Error Message

If we have no errors we display 0 in the middle and no red icon If we have errors, we display the total errors in both the middle and the red icon

Custom Actions

If we have no custom actions we display 0 in the middle and no red icon If we have custom actions, we display 0 in the middle If we have failed custom actions, we display the number of failed in the middle and in the red icon

As you can see sometimes we display failure counters in the middle sometime we don't!

dannycohen commented 10 years ago

@johnsimons - good point.

Thinking about Opie's perspective:

  1. Endpoints: "I need to know how many endpoints are inactive. I also want to know how many are active, so I can make sure the total number of as expected"
  2. Failed messages: "I want to know how many failed messages I have. I do not care (in this context) about the number of successful messages. I just want to know about the failed messages"
  3. Custom Checks: "I want to know how many custom checks failed on all my endpoint instances. It would be nice to know (but optional) to know how many custom checks passed".

If you agree to the above user perspective description:

Then given the current image below

image

The suggested change would be

  1. Endpoints:
    • in addition to "N Active Endpoints" add the text "X Inactive Endpoints".
    • the number on the top remains as-is, indicating the failure numbers
  2. Failed messages:
    • Leave as-is
  3. Custom Checks:
    • Change text to "Failed Custom Checks"

In general, the guidelines would be:

  1. The indicators do not need to exactly the same
  2. The indicators do need to be logical and convey the correct status using the correct phrasing
  3. The number in the red circle always indicates a problem count. If it is there - it is the main indicator of a problem (along with the change in indicator color to red)
  4. The text and numbers in the indicators will indicate status information as it is relevant and interesting per indicator (no need to find a phrasing so that "one size fits all" indicators).

Thoughts ?

johnsimons commented 10 years ago

Should we also have for Custom Checks:

johnsimons commented 10 years ago

we may need to increase the size of the indicators to fit the text

dannycohen commented 10 years ago

Should we also have for Custom Checks: x successful custom checks

Don't think its required. may be even too much info at this stage. For example:

  1. I developed 3 custom checks
  2. I deployed them to 4 endpoints instance
  3. I there fore have 12 custom checks running
  4. 1 endpoints fails, so I see "3 Failed custom checks " (details in custom checks page)
  5. Adding "9 successful custom checks" without adding the details of those custom checks makes it awkward, IMO.

I suggest we wait with that for customer feedback.

dannycohen commented 10 years ago

we may need to increase the size of the indicators to fit the text

Or we can wrap the text. e.g.:

3 Failed
Custom Checks

Either way - I think it is less critical then getting the phrasing right.

indualagarsamy commented 10 years ago

@dannycohen, @johnsimons - How about this? We consistently don't display anything in the center of the indicator. The red number at the top indicates the number of failures in all of the indicators.

Clicking on the indicator itself will display additional details. For example, clicking on the heartbeat indicator will show you the details page with the active and the inactive.

johnsimons commented 10 years ago

I like that, the middle should just contain the title of the indicator, no numbers

On Tuesday, November 12, 2013, Indu Alagarsamy wrote:

@dannycohen https://github.com/dannycohen, @johnsimonshttps://github.com/johnsimons- How about this? We consistently don't display anything in the center of the indicator. The red number at the top indicates the number of failures in all of the indicators.

Clicking on the indicator itself will display additional details. For example, clicking on the heartbeat indicator will show you the details page with the active and the inactive.

— Reply to this email directly or view it on GitHubhttps://github.com/Particular/ServicePulse/issues/59#issuecomment-28273157 .

Regards John Simons NServiceBus

dannycohen commented 10 years ago

@johnsimons / @indualagarsamy -

I don't like this idea if:

  1. Endpoints: We will not have a number for Active vs. Inactive endpoints
  2. Custom Checks: The failed custom checks indicator will have only the text "Failed Custom checks" (which is somewhat ambiguous) and not "0 Failed Custom checks" (which clarifies everything is OK)
johnsimons commented 10 years ago

The fact that the indicator is green and it doesn't have a red number on the top right corner IMHO is enough to indicate that everything is OK.

indualagarsamy commented 10 years ago

I agree with @johnsimons.

andreasohlund commented 10 years ago

+1

On Tue, Nov 12, 2013 at 10:10 AM, Indu Alagarsamy notifications@github.comwrote:

I agree with @johnsimons https://github.com/johnsimons.

— Reply to this email directly or view it on GitHubhttps://github.com/Particular/ServicePulse/issues/59#issuecomment-28278107 .

dannycohen commented 10 years ago
  1. On the active endpoints there must be a clear indication of the number of active and monitored endpoint instances.
  2. I can live with "Failed Cusom Messages" not having a number in it.
johnsimons commented 10 years ago

On the active endpoints there must be a clear indication of the number of active and monitored endpoint instances.

Maybe that should be displayed somewhere else, not in the indicator ?

dannycohen commented 10 years ago

I'm open to suggestions, but I feel strongly that the indicator is where the number of active endpoints should be.

The number of monitored endpoints is an intrinsic indication that things are OK (or not, if that number is not as it is expected).

It should be close to the inactive endpoints indicator so a simple calculation would allow Opie to see that all the endpoints are accounted for. e.g.:

  1. Opie expects to be monitoring 20 endpoints
  2. Opie sees that 3 endpoints have failed
    • If that is the only piece of information displayed - Opie is unable to verify that all the endpoints are accounted for
  3. Opie sees that 16 endpoints are active
    • This means that 1 endpoint is unaccounted for. Opie needs to see that and investigate / take action. Is it properly configured ? is it down ?

Having the number "16 active endpoints" placed away from the number of failed endpoints makes this kind of validation of number of monitored endpoints harder and less intuitive for Opie.

andreasohlund commented 10 years ago

How about the number faded and embedded in the background of the indicator? (sort of to make it more gamified)

On Tue, Nov 12, 2013 at 10:29 AM, Danny Cohen notifications@github.comwrote:

I'm open to suggestions, but I feel strongly that the indicator is where it should be.

The number of monitored endpoints is an intrinsic indication that things are OK (or not, if that number is not as it is expected).

It should be close to the inactive endpoints indicator so a simple calculation would allow Opie to see that all the endpoints are accounted for. e.g.:

  1. Opie expects to be monitoring 20 endpoints
  2. Opie sees that 3 endpoints have failed
    • If that is the only piece of information displayed - Opie is unable to verify that all the endpoints are accounted for
      1. Opie sees that 16 endpoints are active
    • This means that 1 endpoint is unaccounted for. Opie needs to see that and investigate / take action. Is it properly configured ? is it down ?

Having the number "16 active endpoints" placed away from the number of failed endpoints makes this kind of validation of number of monitored endpoints harder and less intuitive for Opie.

— Reply to this email directly or view it on GitHubhttps://github.com/Particular/ServicePulse/issues/59#issuecomment-28279188 .

indualagarsamy commented 10 years ago

@dannycohen - The indicator indicates any failure that Opie needs to be concerned about. I think the fact that there are 16 active endpoints, can be displayed, when Opie clicks on the endpoints heartbeat indicator. Adding more stuff into the indicator could become misleading. Keeping the indicator in a binary state is much more clearer. Red number indicates a failure. To find out, Opie will drill down. Adding more information to the indicator pollutes that. Just my 2 cents.

dannycohen commented 10 years ago

The goal of the indicators design is to provide at-a-glance indication of the status. Forcing Opie to click or hover in order to get that info does not fit this at-a-glance approach (for example, SP indicators may be displayed on a screen monitor, that does not have an accessible mouse)

indualagarsamy commented 10 years ago

At a glance, looking at the dashboard even on a big screen, red is bad, green is good. What does the number 20 active endpoints mean on a screen that is not even clickable? What i am trying to say is that, the dashboard is already indicating / grabbing Opie's attention when needed. When it is green, i.e the indicator is not doing jumping jacks, all is well with the world, I am not sure Opie is going to even pay any attention to the screen. That was my sense working with Ops guys in a different land.

dannycohen commented 10 years ago

@indualagarsamy - I agree there's a smaller chance of Opie paying attention to the indicator when it green, but this does not mean we don't need to show it.

When it is green, the active endpoints accounting is a special case where we are not able to say whether it is really green, or whether we have a false positive (i.e. is 19 the right number or not ? only Opie can know). Because of that, we must display the number and not hide it.

Same applies - only more so - when it is red - we need both the active and inactive endpoints number to indicate the existence or absence of endpoints that are "unaccounted for".

dannycohen commented 10 years ago

@johnsimons / @indualagarsamy / @andreasohlund - How about the following:

  1. We need to display in the dashboard the number of ServicePulse monitored servers (derived from SP license policy requirements; see https://github.com/Particular/Housekeeping/issues/107#issuecomment-28196933)
    • We can add the number of total monitored endpoints
    • E.g. we will add a display of: "Monitoring status: monitoring 20 endpoints on 5 servers"
  2. This, being visible on the dashboard, will relieve us form the need to display the number of active endpoints on the heartbeats indicator

Make sense ?

johnsimons commented 10 years ago

I though I suggested that previously!

On Wednesday, November 13, 2013, Danny Cohen wrote:

@johnsimons https://github.com/johnsimons / @indualagarsamyhttps://github.com/indualagarsamy/ @andreasohlund https://github.com/andreasohlund - How about the following:

  1. We need to display in the dashboard the number of ServicePulse monitored servers (derived from SP license policy requirements; see Particular/Housekeeping#107 (comment)https://github.com/Particular/Housekeeping/issues/107#issuecomment-28196933)
    • We can add the number of total monitored endpoints
    • E.g. we will add a display of: "Monitoring status: monitoring 20 endpoints on 5 servers"
      1. This, being visible on the dashboard, will relieve us form the need to display the number of active endpoints on the heartbeats indicator

Make sense ?

— Reply to this email directly or view it on GitHubhttps://github.com/Particular/ServicePulse/issues/59#issuecomment-28387938 .

Regards John Simons NServiceBus

dannycohen commented 10 years ago

So consider me convinced (better late than never...) Indu and I will discuss this later today (your night) with Sergio. We'll come up with something and see if it fits the vision.

fafachd commented 10 years ago

@johnsimons - You addressed the dashboard-indicator part of Issue #63; what about the part related to the Custom Checks page?

johnsimons commented 10 years ago

@fafachd agree, we have to do that too

dannycohen commented 10 years ago

@johnsimons / @fafachd - See https://github.com/Particular/ServiceControl/issues/173

johnsimons commented 10 years ago

Here is what I am thinking of doing: image

Thoughts?

johnsimons commented 10 years ago

@dannycohen do we have anything yet? Is this really required for v1?

dannycohen commented 10 years ago

@johnsimons - @sergioc will be available this week.

Is this really required for v1?

You opened it, what do you think ?

johnsimons commented 10 years ago

If Sergio can get it done this week then let's get it in otherwise postpone.

On Sunday, 9 February 2014, Danny Cohen notifications@github.com wrote:

@johnsimons https://github.com/johnsimons - @sergiochttps://github.com/sergiocwill be available this week.

Is this really required for v1?

You opened it, what do you think ?

Reply to this email directly or view it on GitHubhttps://github.com/Particular/ServicePulse/issues/59#issuecomment-34567177 .

dannycohen commented 10 years ago

@johnsimons - Got it.

@sergioc - see I'll schedule short sync on this. Stay tuned... :-)

dannycohen commented 10 years ago

@indualagarsamy / @johnsimons - Discussed the issue with @sergioc and he will work on an initial proposal for the indicators display (ASAP). The goal is to implement this display fix by EO Feb.

Later (~March) we will start working on an overhaul to the SP design (for post v1).

sergioc commented 10 years ago

Updated indicators display:

spquickfixfebruary

Note that the "0 failed messages" under "Monitoring" is not entirely consistent with the rest of the list under monitoring. Thoughts on which other status info could be displayed there related to message monitoring?

dannycohen commented 10 years ago

@sergioc - the failed messages shoud show the up-to-date number (i.e. 6) otherwise it will not make sense (you are monitoring 6 failed messages...)

sergioc commented 10 years ago

Correction + improvement to no. error indicator:

quickfixfebruary

johnsimons commented 10 years ago

@indualagarsamy what do you think? To me it looks quite similar to what I posted here

dannycohen commented 10 years ago

@indualagarsamy / @johnsimons - You OK with this ?

fafachd commented 10 years ago

I like it! I assume the "3 Endpoints" and "0 Custom Checks" will be links too. I would be interested in seeing a mock up of the Custom Checks page showing both failing and passing checks.

And heck, while I'm asking for stuff... I would love to see graphs showing historical endpoint throughput, and possibly graphs of endpoint down times :)

dannycohen commented 10 years ago

@johnsimons - lets go with @sergioc proposal (https://github.com/Particular/ServicePulse/issues/59#issuecomment-34927851)

// CC @indualagarsamy

indualagarsamy commented 10 years ago

@dannycohen - The original descrepancy as reported by the original issue has been resolved. For any UI enhancements, how it looks, where, what needs to be moved etc, let's open an enhancement issue in the requirements repo, so it can be properly prioritized. I am closing this issue. If you have any questions, let's sync up and iron out the details.

indualagarsamy commented 10 years ago

After further discussion, we'll remove the summary statistics group view for v1.0. In v1.1, we'll add:

  1. The ability to view both the registered custom checks and to display the failed custom checks count.
  2. We'll have the user interface enhanced for this information be displayed more relative to each indicator (i.e. the endpoints indicator will have active and failed endpoints count, the custom check indicator will have the registered vs failed)