Stop attaching performance data on audit messages

andreasohlund commented 8 years ago

We're currently attaching things like ProcessingStartedAt and ProcessingCompletedAt timestamps to our audit messages. This data is used in SI to show processing time and critical time statistics.

This has the following issues:

Adding data to the message risk pushing the message size over the size limit of the transport. This is a general problem that we have with other headers as well.
The data isn't 100% accurate.

Let me explain the data accuracy issue in more detail.

Since we need to set the headers before we send the message to the audit queue the ProcessingCompletedAt header won't actually reflect the true "processing completed" since the actual sending of the audit message won't be included. What's even worse is that in v6 we batch all outgoing messages and since that happens after the audit message has been "sent" we no longer capture the time it takes to do any outgoing operations. While this is super quick on MSMQ it can take a considerable time on broker transports like ASB and Rabbit.

In general I propose we send data like this "out of band" to monitoring tools like SC since that's the only way for us to be able to get 100% accurate data.

Is the current performance counter accurate?

The performance counter for eg. critical time is more accurate since it captures the time for the entire receive pipeline including the batched sends. But it doesn't capture things like committing the transport transaction, that given DTC can take some time. The only way to get 100% accurate numbers would be to force the transport to "communicate" this to us using some kind of new API in the future.

johnsimons commented 8 years ago

Regarding

What's even worse is that in v6 we batch all outgoing messages

So you saying audit is part of batch ?

johnsimons commented 8 years ago

To me auditing of a message is more an operations concern and therefore it is an outside workflow, so from a developer consumer and usage of that data in SI, I would expect those timings not to be included. As a developer I am looking at the performance associatted with processing x message and debug that, so when OPS comes along and enables logging that feels to me that is outside my "jurisdiction". I am not saying that we should not time how long audits take to be sent out but it feels like that timing is more relevant for an OPS person when doing monitoring, not from SI debugging perspective. Obviously this is all different if you look at it from another perspective.

Maybe we just need more granular performance data, to cater for both teams :wink:

danielmarbach commented 8 years ago

I'd argue auditing should totally go out of band and not be part of the main pipeline. Of course we would need to deal with potential ghost messages and such...

johnsimons commented 8 years ago

I'd argue auditing should totally go out of band and not be part of the main pipeline

:+1:

andreasohlund commented 8 years ago

So you saying audit is part of batch ?

Yes it currently is

I'd argue auditing should totally go out of band and not be part of the main pipeline. Of course we would need to deal with potential ghost messages and such...

Let's discuss this as a separate issue, I'm :+1: , can you raise it @danielmarbach ?

To me auditing of a message is more an operations concern and therefore it is an outside workflow, so from a developer consumer and usage of that data in SI, I would expect those timings not to be included.

Critical time is defined as:

Monitors the age of the oldest message in the queue. This takes into account the whole chain, from the message being sent from the client machine until successfully processed by the server

http://docs.particular.net/nservicebus/operations/performance-counters#nservicebus-performance-counters-critical-time

So from and end users perspective they want an accurate metric on "successfully processed". I'd argue that successfully processed means: When the transport has acked the message as received (or commited the DTC tx etc). If audit is done as part of the receive, which it currently is, it should count. (we can discuss changing this as a separate issue like @danielmarbach mentions)

Note:

The current perfcounter doesn't take sending out any messages into account since they run before the batching step. This is a bug, I'll raise a new issue for that
We currently have no way of measuring hooking into "receive transaction committed", we would need to modify the transport seam to get this. In short this means that the ASB/ASQ/RabbitMQ "ack" call isn't included in the timing. Also if we blows there the perf counter would still report a critical time value even though the message wasn't actually processed. Similarly the time it takes for the DTC to commit is not included as well.
The headers sent to audit will not report the same value as the performance counter. Where the audit header is the most inaccurate since it won't include the time for any of the outgoing operations.

@udidahan with your domain hat on, how would you rate the impact on the flaws mentioned above? (assuming I'm correct)

I believe we need to fix the counter to at least include the sends.

andreasohlund commented 8 years ago

Had a chat with @udidahan , this is what we propose:

Fix the performance counter bug mentioned above for v6
Note in the upgrade guide that the critical and processing time in SI will not include outgoing ops for v6 and guide users towards the perfcounters if they need better precision
Raise an issue to track us moving to a model where all audit data is sent out of band, ie no extra headers added to audit messages (not in scope for v6)
Raise an issue to modify the transport seam to allow the core to hook into "receive operation committed" to allow us to get better accuracy in the future (not in scope for v6)
Document that "successfully processed" means when the receive operation is fully committed but add a note to http://docs.particular.net/nservicebus/operations/performance-counters#nservicebus-performance-counters-critical-time that there is a known issue that we're currently not including the time it takes to "commit" the receive operation

I will close this issue once the above issues have been raised

cc @Particular/nservicebus-maintainers

abombss commented 8 years ago

I would disagree around taking auditing out of the pipeline. Its critical that we can receive an accurate and consistent ACK when a message is processed.

I would propose taking the time to split the concerns in two since the work is being deferred to vNext.

Archiving of a message with its payload
Acknowledgement of successfully handling a message

On the send side we do archiving. Everything sent is archived and it doesn't require an additional message, its an exact copy of what was sent on the wire. We achieve this with clever tricks on our ASB transport, but something similar could be done for other transports. it could also be configurable by Intent so I could choose to do my archiving on the send side for Events and on the receive side for Commands if needed or not at all. If its pluggable folks can choose to archive over whatever channel they want, it doesn't need to be on the same transport channel as the send.

The acknowledgement is critical for our operations, and it must happen in the same tx as the receive. We cannot deal with ghost ACKs as that would defeat the purpose having the ACK in the first place. Making this pluggable is fine for folks who can deal with the risk and just want some rough stats but it must still be able to be part of the pipeline and part of the tx.

Since the ACK would no longer require the payload it would not be a problem to include lots of additional metadata about processing. I think there is an opportunity to make this extensible as well so if there is user tracking data that we want added for these things we could easily plug into that and add our own stuff.

I think this a huge breaking change and also an opportunity to come up with something better.

andreasohlund commented 8 years ago

I would disagree around taking auditing out of the pipeline

Just to clarify, we're still sending the audit message as a part of the processing pipeline. We're just saying to send the "extra audit stuff" out of band.

Ie: Audit message is still sent just like today(ie the processed message will be immutable), but all audit stuff like causation header, processing stats etc will be sent out of band.

This avoids:

The audit message being bigger than the queuing system can handle
We can get accurate timing etc

Does this make sense?

abombss commented 8 years ago

Yes that makes sense, but that doesn't solve the message size problem for errors. The deep stacktraces with the new pipeline add far more payload than two timestamps.

Introducing a concept of archiving that is responsible for the payload and new diagnostic messages for ACK and ERROR that are just references would solve both problems.

It would also add the capability to save all messages regardless of if there was consumer on the other end. We value the archiving as a must have feature which is why we had to roll own implementation.

Also, not including the diagnostics in the audit message would be a major breaking change for us. We would need to rework a lot of our operations tooling since many of the metrics we monitor are tied directly to the audit message. Even if the numbers aren't accurate, it still provides a good relative baseline for us to identify problems with the broker or the endpoints.

andreasohlund commented 8 years ago

Yes that makes sense, but that doesn't solve the message size problem for errors. The deep stacktraces with the new pipeline add far more payload than two timestamps.

Agreed, we are talking about doing the same for errors. https://github.com/Particular/NServiceBus/issues/2368

not including the diagnostics in the audit message would be a major breaking change for us. We would need to rework a lot of our operations tooling

Correct, this is not happening in this major release, we'll give you plenty of heads up if/when this happens

abombss commented 8 years ago

Thanks

On Mon, Mar 7, 2016, 3:18 AM Andreas Öhlund notifications@github.com wrote:

Yes that makes sense, but that doesn't solve the message size problem for errors. The deep stacktraces with the new pipeline add far more payload than two timestamps.

Agreed, we are talking about doing the same for errors. #2368 https://github.com/Particular/NServiceBus/issues/2368

not including the diagnostics in the audit message would be a major breaking change for us. We would need to rework a lot of our operations tooling

Correct, this is not happening in this major release, we'll give you plenty of heads up if/when this happens

— Reply to this email directly or view it on GitHub https://github.com/Particular/NServiceBus/issues/3494#issuecomment-193149368 .

andreasohlund commented 7 years ago

Relevant actions points is tracked by https://github.com/Particular/NServiceBus/issues/2368 and the other referenced issues. Closing this discussion

Particular / NServiceBus

Stop attaching performance data on audit messages #3494

Is the current performance counter accurate?