Particular / ServiceControl

Backend for ServiceInsight and ServicePulse
https://docs.particular.net/servicecontrol/
Other
53 stars 47 forks source link

Error retention period min max value does not make sense #956

Open ramonsmits opened 7 years ago

ramonsmits commented 7 years ago

The MAX value is 45 days, but for the audit period its 1 year and the MIN value is 10 days compared to 1 hour for the audit.

image

image

To me it makes sense that:

SzymonPobiega commented 7 years ago

@ramonsmits I think it makes sense to have shorter retention period for audits because they are, by definition, less critical. That said, the max retention for errors should probably be longer.

gbiellem commented 7 years ago

It's a shame John is no longer on the team since he came up with those numbers and I don't remember the logic behind them

Regarding keeps errors longer, the retention policy only applies to errors that have been resolved. So essentially a user has successfully retried the message in which case we now have an audit record or the user has marked it as ignored. In either case I'd say it's noise that shouldn't be kept long term

SzymonPobiega commented 7 years ago

@gbiellem ok, then +1 for keeping it as-is.

gbiellem commented 7 years ago

I do agree about the min value though - I can't think of a good reason for the ten days

mikeminutillo commented 7 years ago

Yeah the reasoning behind the min was to do with the risk of getting it wrong. Not all systems have auditing switched on so we can't say definitively that a retried message has been processed. With that in mind, the error retention policy does delete errors in the RetryIssued state.

Just because a Retry has been issued doesn't mean it went anywhere. Retried messages can be stuck in the outgoing queue on a machine, the endpoint they were retried to might have been decommissioned, they can be in a DLQ somewhere. It can take some time to figure that out and the system has a "re-retry" option (and a redirects option).

The 10 day minimum was to enforce a window for people to discover that retried messages were stuck somewhere and do something about it. Any shorter and you run the risk that a message Retried on a Friday is gone by the time someone comes in after the long weekend and realizes something is off.

mikeminutillo commented 7 years ago

Also, that's 10 days from when the error is retried. Not 10 days from when the error occurred.

mikeminutillo commented 7 years ago

Personally I'd rather see retention based on conversations. i.e. If this conversation hasn't received any new messages for the last 90 days then delete all messages associated with it.