[Bug][plugins/dora] `incident_deploy_connector` is not matching incident with deployments properly

Adibov commented 1 month ago

Search before asking

[X] I had searched in the issues and found no similar issues.

What happened

DORA has introduced a patch to the definition of CFR in its 2023 report:

For the primary application or service you work on, how long does it generally take to restore service after a change to production or release to users results in degraded service (for example, lead to service impairment or service outage) and subsequently require remediation (for example, require a hotfix, rollback, fix forward, or patch)

The definition has changed to only consider those downtimes as incidents, caused by a deployment. That is for example, if we have two consecutive deployments with 12 hours between them, and some incidents happened after 6 hours after the first deployment, we shouldn't count that incident as a change that led to a failure.

However, in the DORA plugin in Devlake, we match any incident with the latest deployment before it, ignoring the time length between them. This leads to imprecise metrics.

What do you expect to happen

I think a preferred solution is to configure the time length between a deployment and an incident, which we consider as a software fault. That is, having some variable in code that is configurable and user can configure it based on their needs.

How to reproduce

Create one deployment and an incident far after it, that incident will match the deployment.

Anything else

No response

Version

v1.0.1@e061ef2

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

dosubot[bot] commented 1 month ago

Hey @Adibov! I'm here to help you with any bugs, questions, or contributions you have. Let's work together to solve this issue.

I found a similar closed issue that might be relevant to your concern: [Feature][DORA] Update the logic for metric lead time for changes](https://github.com/apache/incubator-devlake/issues/3516). This issue was closed by a pull request that updated the logic for change lead time ^[1].

Regarding your specific issue, the current logic in the DORA plugin for matching incidents with deployments is based on timestamps. Incidents are related to the last deployment on the project with a timestamp that is before the incident's timestamp. This logic does not currently consider the time length between deployments and incidents ^[2].

There is no existing configuration option in the DORA plugin for setting time constraints between deployments and incidents ^[2]. However, you mentioned that you are willing to submit a PR, which could be a great opportunity to introduce such a feature. You might consider adding a configurable time constraint to the ConnectIncidentToDeployment function in incident_deploy_connector.go, which is responsible for connecting incidents to deployments ^[3].

To continue talking to Dosu, mention @dosu.

Shaktty commented 4 weeks ago

I don't agree here. An incident can take time to appear or manifest itslef, but still be due to a deployment of a buggy artifact with defect leakeage. Also, not only full downtime should be identified as a faulty deployment, but also any incident that states a degraded user experience.

Adibov commented 3 weeks ago

@Shaktty I'm on the same page with you, software defects must also include defect leakage and not only full downtimes. That is why I proposed a configurable window to only consider those incidents lying in that window. For example, suppose we deploy a manifest at 6 PM and its availability drops to 90% at 6 AM the next morning. In that case, we shouldn't consider it an incident since it's probably because of an infrastructure defect and is not software-related.

Another possible example is that when the whole Kubernetes cluster goes down, all availabilities drop, and if we submit an incident via webhook, it will be considered a software incident but it's not.

apache / incubator-devlake