Particular / ServiceControl

Backend for ServiceInsight and ServicePulse
https://docs.particular.net/servicecontrol/
Other
52 stars 47 forks source link

Add SLR indicator header #65

Closed dannycohen closed 8 years ago

dannycohen commented 11 years ago

As Opie, I would like to be notified when there are many or increasing number of SLRs performed, so I can investigate further if this is an indication of system instability (and I will be able to take corrective actions before it deteriorates further).

Implementation proposal: SecondLevelRetriesPerformed : true|{false}

Note: In order to conserve header space, avoid sending the header when SLR were not performed (i.e. the very existence of the header is an indication of it being true; the value of false is optional).

andreasohlund commented 11 years ago

@johnsimons please go ahead with this one

indualagarsamy commented 10 years ago

@dannycohen - what is the associated issue in SP?

dannycohen commented 10 years ago

This is currently a bug affecting existing functionality in SI (https://github.com/Particular/ServiceInsight/issues/28)

There are plans for a post V1 SP indicator for SLR requests, but its a low prio in the backlog. Why ?

johnsimons commented 10 years ago

I thought we came to the conclusion that we can't supply this information because those headers are wiped by NSB. Also those headers are an internal implementation concern that can change. On top of it SLRs can be turned off by the user or/and completely custom replaced.

On Wednesday, November 6, 2013, Danny Cohen wrote:

This is currently a bug affecting existing functionality in SI ( Particular/ServiceInsight#28https://github.com/Particular/ServiceInsight/issues/28 )

There are plans for a post V1 SP indicator for SLR requests, but its a low prio in the backlog. Why ?

— Reply to this email directly or view it on GitHubhttps://github.com/Particular/ServiceControl/issues/65#issuecomment-27866387 .

Regards John Simons NServiceBus

dannycohen commented 10 years ago

I thought we came to the conclusion that we can't supply this information because those headers are wiped by NSB.

I was under the impression the number of SLR's is wiped, but that we can check whether or nor SLR's occurred (i.e. a boolean parameter).

Also those headers are an internal implementation concern that can change.

The fact that retries occurr repeatedly may indicate the system is somewhat unstable. A growing trend of retries is similarly significant to Opie and team. Therefore, although it may be implemented as an internal feature, but it is of value and importance to users.

andreasohlund commented 10 years ago

Correct, we can provide a boolean parameter

On Wed, Nov 6, 2013 at 2:20 PM, Danny Cohen notifications@github.comwrote:

I thought we came to the conclusion that we can't supply this information because those headers are wiped by NSB.

I was under the impression the number of SLR's is wipred, but that we cna check whether or nor SLR's occurred (i.e. a boolean parameter).

Also those headers are an internal implementation concern that can change.

The fact that retries occurr repeatedly may indicate the system is somewhat unstable. A growing trend of retries is similarly significant to Opie and team. Therefore, although it may be implemented as an internal feature, but it is of value and importance to users.

— Reply to this email directly or view it on GitHubhttps://github.com/Particular/ServiceControl/issues/65#issuecomment-27872361 .

indualagarsamy commented 10 years ago

@dannycohen @andreasohlund - After a discussion with @johnsimons and @SimonCropp, rather than add a header which is a boolean to say if SLR was invoked it makes more sense to add a header which provides the total number of retries that occured, and this information will be available on messages that were successfully processed. So, by looking at the audit message and looking at the number of retries, we can identify problems.

This count will be the net total (of FLR + SLR) that occured.

Will that satisfy the requirement?

johnsimons commented 10 years ago

This will allow the users to sort by this column in descending order and see which messages get retried most :)

dannycohen commented 10 years ago

@indualagarsamy / @johnsimons / @andreasohlund -

  1. Having the SLR count is nice, and that was the original content of the existing header.
  2. Due to a bug fix, we concluded with @johnsimons and @andreasohlund that this is not possible, and that a boolean value for SLR is the best remaining option.
  3. If you can pass it as a number - great.
  4. If you can pass it as a boolean parameter - more than good enough.

Let me know.

johnsimons commented 10 years ago

What we are proposing is for us to have a new header that will contain the number of times that message was retried.

dannycohen commented 10 years ago

What we are proposing is for us to have a new header that will contain the number of times that message was retried.

There is such a header and it is actively removed in order to fix a bug (see https://github.com/Particular/ServiceInsight/issues/20#issuecomment-24104404).

Are you saying we can un-remove it ?

johnsimons commented 10 years ago

No, I'm proposing a completely new header :)

On 13 November 2013 18:38, Danny Cohen notifications@github.com wrote:

What we are proposing is for us to have a new header that will contain the number of times that message was retried.

There is such a header and it is actively removed in order to fix a bug (see Particular/ServiceInsight#20 (comment)https://github.com/Particular/ServiceInsight/issues/20#issuecomment-24104404 ).

Are you saying we can un-remove it ?

— Reply to this email directly or view it on GitHubhttps://github.com/Particular/ServiceControl/issues/65#issuecomment-28372929 .

Regards John Simons NServiceBus

dannycohen commented 10 years ago

@johnsimons - Whatever makes you happy... :-)

dannycohen commented 10 years ago

@indualagarsamy / @andreasohlund / @johnsimons - guys, where are we on this ?

// CC @HEskandari

andreasohlund commented 10 years ago

This turned out to be nontrivial, can we bump this?

On Fri, Dec 13, 2013 at 8:26 AM, Danny Cohen notifications@github.comwrote:

@indualagarsamy https://github.com/indualagarsamy / @andreasohlundhttps://github.com/andreasohlund/ @johnsimons https://github.com/johnsimons - guys, where are we on this ?

// CC @HEskandari https://github.com/HEskandari

— Reply to this email directly or view it on GitHubhttps://github.com/Particular/ServiceControl/issues/65#issuecomment-30491317 .

andreasohlund commented 10 years ago

@indualagarsamy https://github.com/indualagarsamy can you add the details on the issues we found

On Fri, Dec 13, 2013 at 8:42 AM, Andreas Öhlund < andreas.ohlund@particular.net> wrote:

This turned out to be nontrivial, can we bump this?

On Fri, Dec 13, 2013 at 8:26 AM, Danny Cohen notifications@github.comwrote:

@indualagarsamy https://github.com/indualagarsamy / @andreasohlundhttps://github.com/andreasohlund/ @johnsimons https://github.com/johnsimons - guys, where are we on this ?

// CC @HEskandari https://github.com/HEskandari

— Reply to this email directly or view it on GitHubhttps://github.com/Particular/ServiceControl/issues/65#issuecomment-30491317 .

dannycohen commented 10 years ago

@andreasohlund -

can we bump this?

NP, but it would be good to record the diagnosis and the prognosis.

andreasohlund commented 10 years ago

Indu has the details

In short: We need the new pipeline hooks in NServiceBus 4.4.0 to pull this off

On Fri, Dec 13, 2013 at 8:48 AM, Danny Cohen notifications@github.comwrote:

@andreasohlund https://github.com/andreasohlund -

can we bump this?

NP, but it would be good to record the diagnosis and the prognosis.

— Reply to this email directly or view it on GitHubhttps://github.com/Particular/ServiceControl/issues/65#issuecomment-30492040 .

dannycohen commented 10 years ago

In short: We need the new pipeline hooks in NServiceBus 4.4.0 to pull this off

I'm fine with that.

johnsimons commented 10 years ago

This turned out to be nontrivial, can we bump this?

What was the issue @indualagarsamy ?

indualagarsamy commented 10 years ago

Trying to add the header, we realized that there is no easy way to count the FLRs the way it stands now.

johnsimons commented 10 years ago

What is wrong with adding it in https://github.com/Particular/NServiceBus/blob/develop/src/NServiceBus.Core/Unicast/Transport/FirstLevelRetries.cs#L49 ?

dannycohen commented 10 years ago

Assigning to backlog, pending SC / SP roadmap alignment

johnsimons commented 10 years ago

It can be reopen if we decide to do it

udidahan commented 10 years ago

There is another stakeholder here. As a salesperson for Particular Software, when talking to other departments at our larger customers, I would like to be able to tell them how many messages NServiceBus not only prevented from failing but handled those failures automatically without any manual intervention thus clarifying the value proposition versus the team using HTTP or rolling their own solution on top of a queue.

dannycohen commented 10 years ago

@udidahan - how would you prioritize it ? (post LM3, of course)

udidahan commented 10 years ago

That's a question to be answered via the Trello board, taking into account all the other things we have for LM4.

andreasohlund commented 9 years ago

I think the way to go here is that instead of a header we should create a plugin that hooks into the notifications, http://docs.particular.net/nservicebus/errors/subscribing-to-push-based-error-notifications. This way we can let the users set threshold for flr/slr per second and have the plugin notify SC when that happens?

Arguable this could be implemented and distributed as a custom check?

We could start by adding this as a sample to gauge interest?

johnsimons commented 8 years ago

@Particular/servicecontrol-maintainers I propose we close this one until we ready to tackle it?

SimonCropp commented 8 years ago

@johnsimons if u close it how would distinguish it from all the other issues that are closed "as fixed"?

johnsimons commented 8 years ago

@SimonCropp no milestone associated

SimonCropp commented 8 years ago

@johnsimons so u regularly go through the 268 closed issues with no milestone as part of your backlog pruning? https://github.com/Particular/ServiceControl/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed

johnsimons commented 8 years ago

Not sure what you are talking about @SimonCropp. If we close this issue and later this topic resurfaces, I am pretty sure we can find it and reopen.

andreasohlund commented 8 years ago

I have a demo going that shows how to do this without a header and instead using a custom check to "alert" in SP when SLR rates go up. How about I get that recorded and we close this one as won't fix?

andreasohlund commented 8 years ago

Still agree with @SimonCropp that we shouln't just close issues that we still might act on. Perhaps move it some where or add a bullet point to some more generic "crazy SC ideas" issue.

How about we discuss this in plat dev since this is relevant for all our repos?

mauroservienti commented 8 years ago

Still think agree with @SimonCropp that we shouln't just close issues that we still might act on.

Agree.

johnsimons commented 8 years ago

we shouln't just close issues that we still might act on

In my view, closing a feature request does not imply that we will not be acting on, it just means that right now the focus is elsewhere. It just help us concentrate on issues/features that are currently important. Having this massive backlog of open issues IMO makes it difficult to figure out where we going. The way I see it, is open feature requests that we do not intend to work on (right now), are just clutering the system, and also sending the wrong message to user out there, the fact is that we have no idea when we are going to tackle any of these or even if we ever going to do it at all.

I have a demo going that shows how to do this without a header and instead using a custom check to "alert" in SP when SLR rates go up

@andreasohlund that would be great, looking forward to it.

gbiellem commented 8 years ago

@johnsimons

As @andreasohlund said let's have the discussion about do we or don't we close issues in platform dev.

mauroservienti commented 8 years ago

yeah, let's discuss it. I don't think we should close issues, a closed issue is a lost one, IMO. If a huge backlog is putting pressure on us it's a tolling problem (and to be honest GitHUb issues suck) since we have no way (other than the waffle boards, that are not that much better) to prioritize and look only at the top N issues. Also given that we have no milestones set in stone priorities may change overtime and is very easy to lose track of closed issues that were closed because of lack of capacity or out of scope at that time.

johnsimons commented 8 years ago

@mauroservienti it is quite simple to keep track of feature requests that were closed. This query gives u all issues that are closed and have no associated milestone and are not bugs https://github.com/Particular/ServiceControl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aclosed+no%3Amilestone+-label%3A%22Type%3A+Bug%22+

johnsimons commented 8 years ago

This is currently not prioritised to be done any time soon, I will close this one for now and if we ever go down this path we can reopen it