Azure / azure-sdk-tools

Tools repository leveraged by the Azure SDK team.
MIT License
114 stars 180 forks source link

[LintDiff][PR Workflow] Handle cases where check result is too large to be processed by our infrastructure #7457

Open konrad-jamrozik opened 11 months ago

konrad-jamrozik commented 11 months ago

TLDR:

Dumping here some of my correspondences to provide context:

From email thread with subject LINTDIFF big diff issue: Asking for help & info about issues with LintDiff runs that produce a large diff

Email 1 from me:

Hey Roopesh! Some LintDiff runs produce large amount of data (and run for 25+ minutes). For example see this log which is for this check for PR 26992 Or this one, for Staging LintDiff, also for PR 26992.

Unfortunately, due to our infrastructure drawbacks we cannot process such runs fully, and we have to discard such results, without ever showing them on GitHub. This is because we use MongoDB database to communicate from ADO to GitHub, and the massive amount of data is above the 17 MB limit.

Ask to you from us: While we will be working on figuring out how to make such be fully supported on GitHub, can you please have a look if this is expected behavior? Maybe some of the LintDiff rules are triggering too often? Would possibly truncating the results be an option here?

Email 2:

Hey, some additional info on what caused LintDiff to run for so long and produce so much data.

Looking at the affected PR of 26992, it modified files in 12 different API versions. As a result, LintDiff ran 24 times: for each of the API version, once “before changes” and once “after changes”. LintDiff runs as a plugin to AutoRest, and AutoRest decides which files are in scope by looking at the API version tag in README. This resulted in large amount of files being ingested as input to these AutoRest/LintDiff runs, resulting in the huge log. The tooling doesn’t care about which files from given API versions were modified exactly: it runs twice on all files in given API version, even if only 1 file changed in it.

The question is how we should handle such cases. There doesn’t seem to be much utility in producing this amount of errors, they won’t be too actionable to the PR author. LintDiff appears to be designed to work with PRs that modify no more than few API versions at a time.

Reply by Roopesh:

We need to inspect what sort of changes are being made across multiple API versions given any contract change to previous API versions would result in a breaking change. This is not representative of a typical scenario for which PRs are raised.

How frequently are you seeing this behavior? Also is there an impact beyond just the checks taking longer for that one PR? Or is it slowing checks for other PRs down as well - noisy neighbor?

Email 3 from me:

Re frequency:

I think we saw it, like, once 😃 in just one PR.

With the hotfix from today now the impact on the rest of the system will be primarily delay: the “automated merging requirements met” check won’t conclude until all dependencies conclude, including LintDiff, which may run for 25+ minutes in such cases. Also, the GitHub LintDiff check view won’t reflect the information: if the log is too large, more than 17 MB, now we just discard it and never show the update on GitHub side (can be still viewed in ADO though).

Re:

We need to inspect what sort of changes are being made across multiple API versions given any contract change to previous API versions would result in a breaking change. This is not representative of a typical scenario for which PRs are raised.

The changes have been approved by Mike K. because (info from private Teams group chat):

If the REST API was incorrect -- it does not match the behavior of the service -- then we allow an existing API version (preview or GA) to be "corrected". And that might involve correcting many GA versions. Notes from the approval:

The property does not exist in the OpenAPI

Just correcting the REST API definition.

Approvals like these happen once every few months.

Technical details

The affected PR:

Modified files in 12 API versions. As a result, the LintDiff check launched AutoRest with https://github.com/Azure/azure-openapi-validator extension 24 times: twice (before and after changes) for each API version, resulting in gigantic diff. The tool ran for over 27 minutes and produced log 402 MB in size. The same problem happened for staging LintDiff.

This is the produced 402 MB log: https://dev.azure.com/azure-sdk/590cfd2a-581c-4dcb-a12e-6568ce786175/_apis/build/builds/3336876/logs/20

Hotfix applied

We put a try/catch block over putting things in database that continues on failure:

The failure happened because the task result data was so big, it crossed the 17 MB threshold, as explained in this Stack Overflow answer.

How we found the root-cause

I queried pipeline-bot logs and observed the RangeError [ERR_OUT_OF_RANGE] in column having the console.out logs. Then I looked for relevant logs and found out we put too big of a document, which was crashing our pipeline-bot instances before we added the hotfix try/catch. This log pointed out to the name of the LintDiff check.

Example occurrence of the ERR_OUT_OF_RANGE in logs,

Chart showing when the problem started, due to large size of document to be put to db:

image

konrad-jamrozik commented 8 months ago

@rkmanda asked in this post to increase timeout from 60 to 120 minutes.

Relevant docs:

PR:

konrad-jamrozik commented 8 months ago

More cases:

konrad-jamrozik commented 8 months ago

More occurrences reported by Zhenuha Hu:



konrad-jamrozik commented 8 months ago

At least one case of timing out after 3h reported by Vanessa Arndorfer on Teams here:

And another one timed out after 3h:

More successful-but-long-running occurrences reported by Suhas Rao:

konrad-jamrozik commented 8 months ago

Pull Request 536820: Updated LintDiff.yml: increase timeout from 180 min to 300 min

konrad-jamrozik commented 7 months ago

A case where LintDiff timed out after 5 hours:

Note this is similar to the API spec as explained here:

The logs say:

2024-03-20T02:08:33.1176420Z Processing AutoRest config README: specification/machinelearningservices/resource-manager/readme.md. beforeOrAfter: after.
2024-03-20T02:08:33.1179602Z ENTER definition momentOfTruth.executeAutoRestWithLintDiff. autoRestConfigReadmePath: /mnt/vss/_work/1/azure-rest-api-specs/specification/machinelearningservices/resource-manager/readme.md, tag: package-preview-2024-04, beforeOrAfter: after, openapiType: undefined, rpaasLint: false
2024-03-20T02:08:33.1248713Z Executing AutoRest with LintDiff: node /mnt/vss/_work/_tasks/AzureApiValidation_5654d05d-82c1-48da-ad8f-161b817f6d41/0.0.90/private/azure-swagger-validation/azureSwaggerValidation/node_modules/autorest/dist/app.js --v3 --spectral --azure-validator --semantic-validator=false --model-validator=false --message-format=json --openapi-type=arm --openapi-subtype=arm  --use=@microsoft.azure/openapi-validator@2.2.0 --tag=package-preview-2024-04 /mnt/vss/_work/1/azure-rest-api-specs/specification/machinelearningservices/resource-manager/readme.md
2024-03-20T07:03:12.5332687Z Execution of AutoRest with LintDiff done. Error is not null: false, stdout contains AutoRest 'error': false, stdout contains AutoRest 'fatal': false, stderr contains AutoRest 'error': false, stderr contains AutoRest 'fatal': false
2024-03-20T07:03:12.5337424Z RETURN definition momentOfTruth.executeAutoRestWithLintDiff.

(...)

2024-03-20T07:03:13.2762900Z Processing AutoRest config README: specification/machinelearningservices/resource-manager/readme.md. beforeOrAfter: before.
2024-03-20T07:03:13.2763769Z ENTER definition momentOfTruth.executeAutoRestWithLintDiff. autoRestConfigReadmePath: /mnt/vss/_work/1/lint-c93b354fd9c14905bb574a8834c4d69b/specification/machinelearningservices/resource-manager/readme.md, tag: package-preview-2024-04, beforeOrAfter: before, openapiType: undefined, rpaasLint: false
2024-03-20T07:03:13.2765199Z Executing AutoRest with LintDiff: node /mnt/vss/_work/_tasks/AzureApiValidation_5654d05d-82c1-48da-ad8f-161b817f6d41/0.0.90/private/azure-swagger-validation/azureSwaggerValidation/node_modules/autorest/dist/app.js --v3 --spectral --azure-validator --semantic-validator=false --model-validator=false --message-format=json --openapi-type=arm --openapi-subtype=arm  --use=@microsoft.azure/openapi-validator@2.2.0 --tag=package-preview-2024-04 /mnt/vss/_work/1/lint-c93b354fd9c14905bb574a8834c4d69b/specification/machinelearningservices/resource-manager/readme.md
2024-03-20T07:06:54.4645508Z ##[error]The Operation will be canceled. The next steps may not contain expected logs.
2024-03-20T07:06:59.4796537Z ##[error]The operation was canceled.
2024-03-20T07:06:59.4800092Z ##[section]Finishing: LintDiff

So from the mandatory 2 LintDiff runs, the first one, after, ran for about 4 hours and 56 minutes.

PR:

Update 3/21/2024:

After the increased timeout LintDiff finished after 10h 54min:

konrad-jamrozik commented 7 months ago

Pull Request 541101: Updated LintRPaaS.yml: set timeout to 1440 minutes (24h)