dotnet / dnceng

.NET Engineering Services
MIT License
23 stars 18 forks source link

Cancel all Helix workitems when pipeline run is cancelled #3350

Open chcosta opened 1 month ago

chcosta commented 1 month ago

From partner team

Are Helix workitems supposed to cancel automatically when a pipeline run is cancelled (either manually or because a new commit was pushed)?
For example, I just cancelled this pipeline run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=734381&view=results
However, I still see the submitted workitems in "Running" state under https://helix.dot.net/api/jobs/53ada716-a01b-4981-9913-d493bc7a1f5d/workitems?api-version=2019-06-17

[11:20 AM] Michael Stuckey I can say that the support helix currently has for cancellation has some blind spots that are driven by technical details of the backing data store. For example, Helix depends heavily on Service Bus for managing work queues, but has no real access to any individual workitem until it pops out of the queue. What we have now represents a solid "reasonable first version" of the feature.

[11:21 AM] Michael Stuckey We are aware of the CI cost, and it does periodically get discussed and reviewed when evaluating where dev time will go.

[11:26 AM] Jakob Botsch Nielsen Thanks! The main productivity burden is around the long wait on the queues sometimes... And it caught me by surprise that when I cancelled a pipeline run (to try to reduce said burden on the queue) that didn't actually help on that

Release Note Category

akoeplinger commented 1 month ago

We already try to cancel Helix jobs when a pipeline is cancelled, AzDO is supposed to give tasks about 30seconds to fail gracefully and we use that to cancel jobs: https://github.com/dotnet/arcade/issues/7113

That said, this clearly doesn't work well enough since it doesn't apply to job timeout cancellations.

A simpler solution might be to add a check into the Helix runner python script that checks whether the AzDO job (if available) is still running before running the work item script. The downside is that it means one more AzDO API query per workitem.

jakobbotsch commented 1 month ago

@akoeplinger I do see some log messages of the cancellation being processed in the log of the pipeline I linked above:

2024-07-09T12:45:59.9164021Z   Determining projects to restore...
2024-07-09T12:46:02.0865614Z   Restored D:\a\_work\1\s\src\coreclr\scripts\superpmi-diffs.proj (in 1.73 sec).
2024-07-09T12:46:02.2422983Z   Determining projects to restore...
2024-07-09T12:46:02.9508923Z   Restored D:\a\_work\1\s\src\coreclr\scripts\superpmi-diffs.proj (in 151 ms).
2024-07-09T12:46:03.1447733Z   Sending Job to Windows.10.Amd64.Open...
2024-07-09T12:46:21.6875950Z   Sent Helix Job; see work items at https://helix.dot.net/api/jobs/53ada716-a01b-4981-9913-d493bc7a1f5d/workitems?api-version=2019-06-17
2024-07-09T12:46:22.6972906Z   Waiting for completion of job 53ada716-a01b-4981-9913-d493bc7a1f5d on Windows.10.Amd64.Open (Details: https://helix.dot.net/api/jobs/53ada716-a01b-4981-9913-d493bc7a1f5d/details?api-version=2019-06-17 )
2024-07-09T13:54:54.5154049Z ##[error]The Operation will be canceled. The next steps may not contain expected logs.
2024-07-09T13:54:54.5223921Z Attempting to cancel the build...
2024-07-09T13:54:54.5298971Z D:\a\_work\1\s\.packages\microsoft.dotnet.helix.sdk\9.0.0-beta.24327.1\tools\Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(68,5): warning : Build task was cancelled while waiting on job '53ada716-a01b-4981-9913-d493bc7a1f5d'.  Attempting to cancel this job in Helix... [D:\a\_work\1\s\src\coreclr\scripts\superpmi-diffs.proj]
2024-07-09T13:54:54.5353605Z ##[warning].packages\microsoft.dotnet.helix.sdk\9.0.0-beta.24327.1\tools\Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(68,5): warning : (NETCORE_ENGINEERING_TELEMETRY=Build) Build task was cancelled while waiting on job '53ada716-a01b-4981-9913-d493bc7a1f5d'.  Attempting to cancel this job in Helix...
2024-07-09T13:54:54.6996742Z ##[error]The operation was canceled.
2024-07-09T13:54:54.7001255Z ##[section]Finishing: Send job to Helix (Windows)

However, it doesn't seem like this actually worked to cancel the Helix work items, and also I don't see some of the log messages that I would expect to see from the code: https://github.com/dotnet/arcade/blob/cfdb403bb67e5a4342dd056192a0016283827031/src/Microsoft.DotNet.Helix/Sdk/WaitForHelixJobCompletion.cs#L87-L97

akoeplinger commented 1 month ago

It does say Attempting to cancel this job in Helix... but yeah I've seen it not working. I also think this will never really work for job timeouts so a check on either the Helix server side or when starting to run the workitem is more resilient.

akoeplinger commented 1 month ago

A simpler solution might be to add a check into the Helix runner python script that checks whether the AzDO job (if available) is still running before running the work item script. The downside is that it means one more AzDO API query per workitem.

We could presumably also add an API to helix which proxies this and does some caching if the AzDO API call from the workitem is not viable.

akoeplinger commented 1 month ago

Ah and since I just re-read the initial request, that one mentions cancelling running workitems which I think is much harder given the current system design than cancelling queued (or just about to start) workitems.

jakobbotsch commented 1 month ago

It does say Attempting to cancel this job in Helix... but yeah I've seen it not working.

It does, but after cancellation it logs either "Successfully cancelled job '{jobName}'" or "Job '{jobName}' was already cancelled.", but neither of those log messages are there. I wonder if Azure killed it before it finished, or if CancelAsync just isn't working (either because it is too slow or for some other reason...).