dotnet / dnceng

.NET Engineering Services
MIT License
25 stars 18 forks source link

Source Index storage authentication failure #2974

Open jeffschwMSFT opened 4 months ago

jeffschwMSFT commented 4 months ago

Build

https://dev.azure.com/dnceng/internal/_build/results?buildId=2462759

Status: 403 (This request is not authorized to perform this operation.)
ErrorCode: AuthorizationFailure

Content:
<?xml version="1.0" encoding="utf-8"?><Error><Code>AuthorizationFailure</Code><Message>This request is not authorized to perform this operation.
RequestId:9c862d9a-501e-0093-2a0e-b2337f000000
Time:2024-05-29T21:26:58.8871758Z</Message></Error>

Headers:
Server: Microsoft-HTTPAPI/2.0
x-ms-request-id: 9c862d9a-501e-0093-2a0e-b2337f000000
x-ms-client-request-id: f924d565-53ef-40a6-9d9a-c7b2ecd56976
x-ms-error-code: AuthorizationFailure
Date: Wed, 29 May 2024 21:26:58 GMT
Content-Length: 246
Content-Type: application/xml

   at Azure.Storage.Blobs.BlockBlobRestClient.UploadAsync(Int64 contentLength, Stream body, Nullable`1 timeout, Byte[] transactionalContentMD5, String blobContentType, String blobContentEncoding, String blobContentLanguage, Byte[] blobContentMD5, String blobCacheControl, IDictionary`2 metadata, String leaseId, String blobContentDisposition, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, String encryptionScope, Nullable`1 tier, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, String ifMatch, String ifNoneMatch, String ifTags, String blobTagsString, Nullable`1 immutabilityPolicyExpiry, Nullable`1 immutabilityPolicyMode, Nullable`1 legalHold, Byte[] transactionalContentCrc64, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.Specialized.BlockBlobClient.UploadInternal(Stream content, BlobHttpHeaders blobHttpHeaders, IDictionary`2 metadata, IDictionary`2 tags, BlobRequestConditions conditions, Nullable`1 accessTier, BlobImmutabilityPolicy immutabilityPolicy, Nullable`1 legalHold, IProgress`1 progressHandler, UploadTransferValidationOptions transferValidationOverride, String operationName, Boolean async, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.Specialized.BlockBlobClient.<>c__DisplayClass65_0.<<GetPartitionedUploaderBehaviors>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Azure.Storage.PartitionedUploader`2.UploadInternal(Stream content, Nullable`1 expectedContentLength, TServiceSpecificData args, IProgress`1 progressHandler, Boolean async, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.BlobClient.StagedUploadInternal(Stream content, BlobUploadOptions options, Boolean async, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.BlobClient.UploadAsync(Stream content)
   at UploadIndexStage1.Program.Main(String[] args) in D:\a\_work\1\s\src\UploadIndexStage1\Program.cs:line 130
   at UploadIndexStage1.Program.Main(String[] args) in D:\a\_work\1\s\src\UploadIndexStage1\Program.cs:line 131
   at UploadIndexStage1.Program.<Main>(String[] args)
##[error]Cmd.exe exited with code '57005'.

Build leg reported

No response

Pull Request

No response

Known issue core information

Fill out the known issue JSON section by following the step by step documentation on how to create a known issue

 {
    "ErrorMessage" : "at UploadIndexStage1.Program.Main",
    "BuildRetry": false,
    "ErrorPattern": "",
    "ExcludeConsoleLog": false
 }

@dotnet/dnceng

Release Note Category

Additional information about the issue reported

No response

Known issue validation

Build: :mag_right: https://dev.azure.com/dnceng/internal/_build/results?buildId=2462759 Error message validated: [at UploadIndexStage1.Program.Main] Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 5/30/2024 4:07:44 PM UTC

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0
riarenas commented 4 months ago

@directhex was working on this.

directhex commented 4 months ago

I'm gonna have to give a big ol' shrug emoji here. According to the Azure activity log, no config changes have been made to the storage account in the last week. Subsequent and prior builds were fine, e.g. https://dev.azure.com/dnceng/internal/_build/results?buildId=2463257&view=logs&j=316d5c15-0c50-544e-8051-e6b14a1ab674&t=657dbd0b-e93c-5e59-4ba8-77889f010efc

Looking at it, it seems the common element is failed builds take longer than 30 seconds to do the upload. Possibly we're only getting an incredibly short lease, and it's expiring?

https://dev.azure.com/dnceng/internal/_build/results?buildId=2462759&view=logs&j=316d5c15-0c50-544e-8051-e6b14a1ab674&t=657dbd0b-e93c-5e59-4ba8-77889f010efc took 36s https://dev.azure.com/dnceng/internal/_build/results?buildId=2463071&view=logs&j=316d5c15-0c50-544e-8051-e6b14a1ab674&t=657dbd0b-e93c-5e59-4ba8-77889f010efc took 31s

The "good" builds are in the low twenties of seconds.

lewing commented 3 months ago

Should we change BuildRetry to true for this? If retry is likely to succeed it would help lessen the impact

directhex commented 3 months ago

IMHO adding either a retry, or a "mark failure as success so the build as a whole succeeds" is fine - absolute worst case is stale indices if an index upload fails.

ericstj commented 3 months ago

Looking at it, it seems the common element is failed builds take longer than 30 seconds to do the upload.

Sounds like a solid theory. @directhex maybe we can refresh our access token right before we upload, after we've tar'ed everything? https://learn.microsoft.com/en-us/cli/azure/authenticate-azure-cli#refresh-tokens

https://github.com/dotnet/source-indexer/blob/4207666f77be68741745420ea6576ca28baf4742/src/UploadIndexStage1/Program.cs#L130C37-L130C48

Alternatively we split-out the step that creates the tarball so that we can login (or refresh) separately, right before we upload.

Alternatively/additionally - we might also be able to speed up the tar'ing phase of this by replacing SharpZipLib with the API's @carlossanlop added - which I would expect are faster since they use our zlib under the hood vs SharpZipLib which has an non-vectorized managed implementation.

ericstj commented 3 months ago

Noting here that @directhex is working on a fix for this but I'm unable to assign this issue to him.

directhex commented 3 months ago

This might accidentally be fixed by https://github.com/dotnet/arcade/pull/14912

ericstj commented 2 months ago

Seems like the hit count here is high enough to warrant us taking steps to mitigate it, @directhex. What do you suggest we do - add more logging, make a change to how we auth, retries, something else?

riarenas commented 2 months ago

A shot in the dark here, Did we add firewall rules to this storage account's network access? There were a lot of issues with the instructions on how to do that, since it didn't account for all possible IPs, and there's an entire scenario where firewall rules don't work for traffic from within the same region, so that could explain why it seems flaky.

directhex commented 2 months ago

We did add the firewall rules which were recommended at one point, and yes I think it's feasible that it's a firewall issue.

We don't have functional replacement guidance on network isolation for storage containers yet

riarenas commented 2 months ago

Would it be worth reverting those changes while we wait for the new updated guidance and for the NSP features to light up? Whether it helps with reliability or not would still help get some data.

directhex commented 2 months ago

OK I've flipped the switch to "Public network access: Enabled from all networks" on the storage container in Azure.

directhex commented 2 months ago

I guess I can't trust the 24h figure from yesterday, due to the Azure outage

riarenas commented 2 months ago

well, we're down to 0 hits in last 7 days so it seems to have worked?