NuGet / NuGetGallery

NuGet Gallery is a package repository that powers https://www.nuget.org. Use this repo for reporting NuGet.org issues.
https://www.nuget.org/
Apache License 2.0
1.54k stars 643 forks source link

Orchestrator validation sets can be stuck when two processor copies happen in parallel #7590

Closed joelverhagen closed 4 years ago

joelverhagen commented 5 years ago

Bug Hit Count

Environment Count
DEV 2
INT 1
PROD 7

Explanation

A validation set (dccf5bef-f050-4655-a3df-148ae192605e) became stuck when two parallel threads we processing the completion of the ScanAndSign processor. From the orchestrator's perspective, this was the sequence of events:

Thread Event
1 Received message
1 Determined ScanAndSign is complete
1 Start the copy of the ScanAndSign .nupkg to the validation-set location
2 Received message
2 Determined ScanAndSign is complete
2 Start the copy of the ScanAndSign .nupkg to the validation-set location
1 Clean up the ScanAndSign .nupkg
1 Start PackageSigningValidator2
1 Complete message
2 Copy fails with source missing

When the copy on thread 2 fails, it has already set the destination blob to zero bytes.

The copy process uses an empty etag condition so it has no problem clobbering the validation set package with a "duplicate" copy operation. The line of code is here: https://github.com/NuGet/NuGet.Jobs/blob/c3912a04df42ab4146b1cd8069f3baf8a3793d9b/src/NuGet.Services.Validation.Orchestrator/ValidationPackageFileService.cs#L226

When the PackageSigningValidator2 wakes up to validate the package, it sees an empty blob and fails.

Mitigation

Revalidating the package fixes the problem every time from my experience.

If this happens a lot, we can do our best to reduce message duplication by turning off queue-back. Note that this will increase validation times.

Fix

I talked to @agr and we came up with three solutions, one which was ruled out.

  1. Add a validation-set package etag so that one of the copies fails with etag mismatch.
  2. Add a lease on the validation-set package.
  3. Don't clean up the ScanAndSign copy and less blob storage cleanup do it's thing. We can't do this one because you can still have a slow processor completion clobber a subsequent blob. For example, processor A is being completed still but the validation-set blob has already been updated with work from processor B.

Detecting the problem

traces
| where timestamp > todatetime('2019-10-09T01:00:00') 
| where message contains "VALIDATION SET ID OF THE PACKAGE"
| extend callGuid = tostring(customDimensions.CallGuid)
| extend instanceName = tostring(customDimensions.InstanceName)
| where instanceName !startswith "Validation.Orchestrator-"
| order by timestamp desc
| project timestamp, callGuid, instanceName, message

In Validation.PackageSigning.ProcessSignature-* logs, you'll see something like this over and over:

Downloaded 0 bytes in 0.0104824 seconds for request...

Examples

Validation Set IDs:

DEV

2019-09-30 - dccf5bef-f050-4655-a3df-148ae192605e
2019-10-09 - d81c7729-8bdb-4cac-b1d9-e20b2979af90

INT

2019-10-10 - 4b1517a4-0f4f-4aef-b16d-3e8bede745b2 - GroupDocs.Viewer 19.9.0-alpha-20191010083705

PROD

2019-10-09 - 60f1974a-11e8-4c95-9eea-74119e80a24b - HZH_Controls 1.0.7
2019-10-09 - 495d3ff0-7912-4f1e-a8a7-4c9ace92b8c2 - amgbuild 0.28.0
2019-10-09 - e963448d-64e4-4b00-a3c7-5f722b118dc5 - PeanutButter.TempDb.Runner 1.2.310
2019-10-09 - 804c07a8-6c5c-4c4d-91b2-b1df70aff4b3 - MassTransit.DocumentDb 5.5.6-develop.2194
2019-10-09 - cf2a2dd4-97aa-4e0c-b27c-4bfcb4f6155b - WebVella.TagHelpers 1.0.24
2019-10-09 - bbb4d00c-28a2-41e3-9686-90f613fc8716 - IDTechSDK 1.2.174.3
2019-10-10 - 355ba0f9-1b0e-4e85-bdaf-27e66c21b9ba - RobotRaconteurNET 0.9.1
joelverhagen commented 4 years ago

I have disabled queue-back because of this problem. The improve validation times are not worth the on-call cost to manually revalidate this many. A code fix should be applied before queue-back is turned back on.

joelverhagen commented 4 years ago

Mitigated with https://github.com/NuGet/NuGetGallery/issues/7629. Will re-open if the problem persists.