Open Adibov opened 1 month ago
Hey @Adibov! I'm here to help you with any bugs, questions, or contributions you have. Let's work together to solve this issue.
I found a similar closed issue that might be relevant to your concern: [Feature][DORA] Update the logic for metric lead time for changes
](https://github.com/apache/incubator-devlake/issues/3516). This issue was closed by a pull request that updated the logic for change lead time [1].
Regarding your specific issue, the current logic in the DORA plugin for matching incidents with deployments is based on timestamps. Incidents are related to the last deployment on the project with a timestamp that is before the incident's timestamp. This logic does not currently consider the time length between deployments and incidents [2].
There is no existing configuration option in the DORA plugin for setting time constraints between deployments and incidents [2]. However, you mentioned that you are willing to submit a PR, which could be a great opportunity to introduce such a feature. You might consider adding a configurable time constraint to the ConnectIncidentToDeployment
function in incident_deploy_connector.go
, which is responsible for connecting incidents to deployments [3].
To continue talking to Dosu, mention @dosu.
I don't agree here. An incident can take time to appear or manifest itslef, but still be due to a deployment of a buggy artifact with defect leakeage. Also, not only full downtime should be identified as a faulty deployment, but also any incident that states a degraded user experience.
@Shaktty I'm on the same page with you, software defects must also include defect leakage and not only full downtimes. That is why I proposed a configurable window to only consider those incidents lying in that window. For example, suppose we deploy a manifest at 6 PM and its availability drops to 90% at 6 AM the next morning. In that case, we shouldn't consider it an incident since it's probably because of an infrastructure defect and is not software-related.
Another possible example is that when the whole Kubernetes cluster goes down, all availabilities drop, and if we submit an incident via webhook, it will be considered a software incident but it's not.
Search before asking
What happened
DORA has introduced a patch to the definition of CFR in its 2023 report:
The definition has changed to only consider those downtimes as incidents, caused by a deployment. That is for example, if we have two consecutive deployments with 12 hours between them, and some incidents happened after 6 hours after the first deployment, we shouldn't count that incident as a change that led to a failure.
However, in the DORA plugin in Devlake, we match any incident with the latest deployment before it, ignoring the time length between them. This leads to imprecise metrics.
What do you expect to happen
I think a preferred solution is to configure the time length between a deployment and an incident, which we consider as a software fault. That is, having some variable in code that is configurable and user can configure it based on their needs.
How to reproduce
Create one deployment and an incident far after it, that incident will match the deployment.
Anything else
No response
Version
v1.0.1@e061ef2
Are you willing to submit PR?
Code of Conduct