Closed joelverhagen closed 4 years ago
I have disabled queue-back because of this problem. The improve validation times are not worth the on-call cost to manually revalidate this many. A code fix should be applied before queue-back is turned back on.
Mitigated with https://github.com/NuGet/NuGetGallery/issues/7629. Will re-open if the problem persists.
Bug Hit Count
Explanation
A validation set (
dccf5bef-f050-4655-a3df-148ae192605e
) became stuck when two parallel threads we processing the completion of theScanAndSign
processor. From the orchestrator's perspective, this was the sequence of events:ScanAndSign
is completeScanAndSign
.nupkg to thevalidation-set
locationScanAndSign
is completeScanAndSign
.nupkg to thevalidation-set
locationScanAndSign
.nupkgPackageSigningValidator2
When the copy on thread 2 fails, it has already set the destination blob to zero bytes.
The copy process uses an empty etag condition so it has no problem clobbering the validation set package with a "duplicate" copy operation. The line of code is here: https://github.com/NuGet/NuGet.Jobs/blob/c3912a04df42ab4146b1cd8069f3baf8a3793d9b/src/NuGet.Services.Validation.Orchestrator/ValidationPackageFileService.cs#L226
When the
PackageSigningValidator2
wakes up to validate the package, it sees an empty blob and fails.Mitigation
Revalidating the package fixes the problem every time from my experience.
If this happens a lot, we can do our best to reduce message duplication by turning off queue-back. Note that this will increase validation times.
Fix
I talked to @agr and we came up with three solutions, one which was ruled out.
validation-set
package etag so that one of the copies fails with etag mismatch.validation-set
package.Don't clean up the. We can't do this one because you can still have a slow processor completion clobber a subsequent blob. For example, processor A is being completed still but theScanAndSign
copy and less blob storage cleanup do it's thingvalidation-set
blob has already been updated with work from processor B.Detecting the problem
In
Validation.PackageSigning.ProcessSignature-*
logs, you'll see something like this over and over:Examples
Validation Set IDs:
DEV
INT
PROD