Open konrad-jamrozik opened 11 months ago
@rkmanda asked in this post to increase timeout from 60 to 120 minutes.
Relevant docs:
PR:
More cases:
More occurrences reported by Zhenuha Hu:
At least one case of timing out after 3h reported by Vanessa Arndorfer on Teams here:
And another one timed out after 3h:
More successful-but-long-running occurrences reported by Suhas Rao:
Pull Request 536820: Updated LintDiff.yml: increase timeout from 180 min to 300 min
A case where LintDiff timed out after 5 hours:
Note this is similar to the API spec as explained here:
/azure-rest-api-specs/specification/machinelearningservices/resource-manager/readme.md
--tag=package-preview-2024-04
versus the previous --tag=package-2024-04
- input-file:
- Microsoft.MachineLearningServices/preview/2024-04-01-preview/machineLearningServices.json
- Microsoft.MachineLearningServices/preview/2024-04-01-preview/mfe.json
- Microsoft.MachineLearningServices/preview/2024-04-01-preview/registries.json
- Microsoft.MachineLearningServices/preview/2024-04-01-preview/workspaceFeatures.json
- Microsoft.MachineLearningServices/preview/2024-04-01-preview/workspaceRP.json
input-files
listed above located in this directoryThe logs say:
2024-03-20T02:08:33.1176420Z Processing AutoRest config README: specification/machinelearningservices/resource-manager/readme.md. beforeOrAfter: after.
2024-03-20T02:08:33.1179602Z ENTER definition momentOfTruth.executeAutoRestWithLintDiff. autoRestConfigReadmePath: /mnt/vss/_work/1/azure-rest-api-specs/specification/machinelearningservices/resource-manager/readme.md, tag: package-preview-2024-04, beforeOrAfter: after, openapiType: undefined, rpaasLint: false
2024-03-20T02:08:33.1248713Z Executing AutoRest with LintDiff: node /mnt/vss/_work/_tasks/AzureApiValidation_5654d05d-82c1-48da-ad8f-161b817f6d41/0.0.90/private/azure-swagger-validation/azureSwaggerValidation/node_modules/autorest/dist/app.js --v3 --spectral --azure-validator --semantic-validator=false --model-validator=false --message-format=json --openapi-type=arm --openapi-subtype=arm --use=@microsoft.azure/openapi-validator@2.2.0 --tag=package-preview-2024-04 /mnt/vss/_work/1/azure-rest-api-specs/specification/machinelearningservices/resource-manager/readme.md
2024-03-20T07:03:12.5332687Z Execution of AutoRest with LintDiff done. Error is not null: false, stdout contains AutoRest 'error': false, stdout contains AutoRest 'fatal': false, stderr contains AutoRest 'error': false, stderr contains AutoRest 'fatal': false
2024-03-20T07:03:12.5337424Z RETURN definition momentOfTruth.executeAutoRestWithLintDiff.
(...)
2024-03-20T07:03:13.2762900Z Processing AutoRest config README: specification/machinelearningservices/resource-manager/readme.md. beforeOrAfter: before.
2024-03-20T07:03:13.2763769Z ENTER definition momentOfTruth.executeAutoRestWithLintDiff. autoRestConfigReadmePath: /mnt/vss/_work/1/lint-c93b354fd9c14905bb574a8834c4d69b/specification/machinelearningservices/resource-manager/readme.md, tag: package-preview-2024-04, beforeOrAfter: before, openapiType: undefined, rpaasLint: false
2024-03-20T07:03:13.2765199Z Executing AutoRest with LintDiff: node /mnt/vss/_work/_tasks/AzureApiValidation_5654d05d-82c1-48da-ad8f-161b817f6d41/0.0.90/private/azure-swagger-validation/azureSwaggerValidation/node_modules/autorest/dist/app.js --v3 --spectral --azure-validator --semantic-validator=false --model-validator=false --message-format=json --openapi-type=arm --openapi-subtype=arm --use=@microsoft.azure/openapi-validator@2.2.0 --tag=package-preview-2024-04 /mnt/vss/_work/1/lint-c93b354fd9c14905bb574a8834c4d69b/specification/machinelearningservices/resource-manager/readme.md
2024-03-20T07:06:54.4645508Z ##[error]The Operation will be canceled. The next steps may not contain expected logs.
2024-03-20T07:06:59.4796537Z ##[error]The operation was canceled.
2024-03-20T07:06:59.4800092Z ##[section]Finishing: LintDiff
So from the mandatory 2 LintDiff runs, the first one, after
, ran for about 4 hours and 56 minutes.
PR:
Update 3/21/2024:
After the increased timeout LintDiff finished after 10h 54min:
Pull Request 541101: Updated LintRPaaS.yml: set timeout to 1440 minutes (24h)
TLDR:
Dumping here some of my correspondences to provide context:
From email thread with subject
LINTDIFF big diff issue: Asking for help & info about issues with LintDiff runs that produce a large diff
Email 1 from me:
Email 2:
Reply by Roopesh:
Email 3 from me:
Re:
The changes have been approved by Mike K. because (info from private Teams group chat):
Technical details
The affected PR:
Modified files in 12 API versions. As a result, the LintDiff check launched AutoRest with https://github.com/Azure/azure-openapi-validator extension 24 times: twice (
before
andafter
changes) for each API version, resulting in gigantic diff. The tool ran for over 27 minutes and produced log 402 MB in size. The same problem happened for staging LintDiff.This is the produced 402 MB log: https://dev.azure.com/azure-sdk/590cfd2a-581c-4dcb-a12e-6568ce786175/_apis/build/builds/3336876/logs/20
Hotfix applied
We put a
try/catch
block over putting things in database that continues on failure:The failure happened because the task result data was so big, it crossed the 17 MB threshold, as explained in this Stack Overflow answer.
How we found the root-cause
I queried pipeline-bot logs and observed the
RangeError [ERR_OUT_OF_RANGE]
in column having theconsole.out
logs. Then I looked for relevant logs and found out we put too big of a document, which was crashing our pipeline-bot instances before we added the hotfix try/catch. This log pointed out to the name of the LintDiff check.Example occurrence of the ERR_OUT_OF_RANGE in logs,
Chart showing when the problem started, due to large size of document to be put to db: