dotnet / arcade-services

Arcade Engineering Services
MIT License
61 stars 74 forks source link

Examine the reason we are missing some Attempts records in the `engineeringdata` Kusto DB. #3129

Closed andriipatsula closed 11 months ago

andriipatsula commented 11 months ago

Example:

TimelineRecords
| where not(Path has '.') and Path has '038'
| where BuildId in (2315074, 2315077)
| order by BuildId
Produces next result: BuildId Name Result Attempt
2315077 SignalR Publishing succeeded 2
2315074 SignalR Publishing succeeded 2
2315074 SignalR Publishing failed 1

where for build 2315077 we have single records with the Attempt column equal 2 and for 2315074 - two records.

Both builds 2315077 and 2315074 are identical: the same number of stages and the same failure.

To get all similar examples for the Release pipeline:

TimelineBuilds
| where Repository == "dotnet-release" and SourceBranch == 'main' and DefinitionId in (984)
| join kind=inner(
    TimelineRecords
    | where not(Path has '.')
    | summarize CountAttempt = count(), MinAttempt = min(Attempt), MaxAttempt = max(Attempt) by Name, BuildId
) on BuildId
| where MaxAttempt > 1 and MinAttempt > 1
//| where CountAttempt == 1 and MaxAttempt != 1
| distinct QueueTime, BuildId, Name, CountAttempt, MinAttempt, MaxAttempt
| order by QueueTime, BuildId
andriipatsula commented 11 months ago

I created synthetic test:

At first, the idea was that the reason we have different records is because of the way the customer re-ran the failed stage/job. I tried both options "Re-run failed jobs" & "Re-run all jobs", and was not able to simulate the problem.

And the only reasonable explanation is that when the build failed, the TelemetryService received a notification through the webhook that the build finished and started processing the results (calling Azure DevOps Timeline API). It depends on when the customer restarts the failing stage: immediately or after some amount of time; during the time the Telemetry Service is processing the build or after.

I encountered a situation where the Telemetry Service wrote 3 records in engineeringdata.

andriipatsula commented 11 months ago

Example of calling the Telemetry API over the inProgress stage (after re-run failed job button is pressed)

            "previousAttempts": [
                {
                    **"attempt": 1,**
                    "timelineId": "previousAttempts_timelineid",
                    "recordId": "previousAttempts_recordid"
                }
            ],
            "id": "***********************************",
            "parentId": null,
            "type": "Stage",
            "name": "Some stage name",
            "startTime": null,
            "finishTime": null,
            "currentOperation": null,
            "percentComplete": null,
            **"state": "inProgress",**
            "result": null,
            "resultCode": null,
            "changeId": 67,
            "lastModified": "0001-01-01T00:00:00",
            "workerName": null,
            "order": 2,
            "details": null,
            "errorCount": 0,
            "warningCount": 0,
            "url": null,
            "log": null,
            "task": null,
            **"attempt": 2,**
            "identifier": "Some stage name"
        }

There is no information available for previousAttempts_timelineid and previousAttempts_recordid IDs.

andriipatsula commented 11 months ago

I double-checked that the data in cloudmine looks similar:

BuildTimelineRecord
| where OrganizationName in ('dnceng', 'dnceng-public') and BuildId in (2315074, 2315077)
| where Type == "Stage"
| where RecordName has "SignalR"
| summarize CountAttempt = count(), MinAttempt = min(Attempt), MaxAttempt = max(Attempt) by RecordName, BuildId
RecordName BuildId CountAttempt MinAttempt MaxAttempt
SignalR Publishing 2315074 1 2 2
SignalR Publishing 2315077 1 2 2

But only one record per build..

andriipatsula commented 11 months ago

And I am able to find build with two (and more) records in the BuildTimelineRecord table

RecordName BuildId CountAttempt MinAttempt MaxAttempt
SignalR Publishing 2,286,846 2 1 2
andriipatsula commented 11 months ago

So, the reason for that is that Timeline api returns the latest attempt only and you can request previous attempts by having additional API call with the timelineId parameter specified (see doc for details). Cloudmine has only latest attempt and in most cases it's impossible to get data about previous attempts (see https://stackoverflow.microsoft.com/questions/372617 as example)