Azure / azure-sdk-for-net

This repository is for active development of the Azure SDK for .NET. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/dotnet/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-net.
MIT License
5.25k stars 4.59k forks source link

[FormRecognizer] Service is throttling #31914

Open kinelski opened 1 year ago

kinelski commented 1 year ago

Description

This issue collects multiple failures we've seen in the Form Recognizer test pipeline for the last couple of months. All these issues are believed to have the same underlying cause: throttling on the service side. They have all been reported to the Form Recognizer service team and are under investigation.

Failures

All failures described below are not happening deterministically and affect multiple different tests.

Content is not accessible

Content: { "error": { "code": "InvalidRequest", "message": "Invalid request.", "innererror": { "code": "ContentSourceNotAccessible", "message": "Content is not accessible: Could not retrieve build data within 60 seconds." } } }

- **Request ID reference:** 0a177c1a-3ba9-4086-af67-5401153eab4c
- [Example in the test pipeline](https://dev.azure.com/azure-sdk/internal/_build/results?buildId=1928370&view=logs&j=01d5a77e-6e85-5280-0e9f-af5629bd443f&t=1e514895-521a-57d8-4030-b9d8f4f46b20&l=290)

### Generic error during training
- **API version:** 2.1
- **Thrown when:** in the LRO GET request when polling `CreateCustomFormModelOperation`.
- **Frequency:** daily
- **Error message example:**

Azure.RequestFailedException : Invalid model created with ID 94ae7f2b-d7c1-4509-9df1-00f7e802d956 Status: 200 (OK) ErrorCode: 3014

Additional Information: error-0: 3014: Generic error during training.

Content:

- **Request ID reference:** 7fd3ef92-58dd-4115-b724-b5a3dfbcac55
- [Example in the test pipeline](https://dev.azure.com/azure-sdk/internal/_build/results?buildId=1928370&view=logs&j=01d5a77e-6e85-5280-0e9f-af5629bd443f&t=1e514895-521a-57d8-4030-b9d8f4f46b20&l=457)

### Could not access Azure blob storage account
- **API version:** 2.1
- **Thrown when:** in the LRO POST request when calling `StartTraining` (in `FormTrainingClient`).
- **Frequency:** only appears around one day every 1.5 weeks but affects multiple v2.1 tests on the day it appears. It's always accompanied by errors "Managed Identity credential was rejected by the storage service" described below.
- **Error message example:**

Azure.RequestFailedException : Could not access Azure blob storage account. Status: 400 (Bad Request) ErrorCode: 2011

Content: {"error":{"code":"2011","message":"Could not access Azure blob storage account."}}

- **Request ID reference:** 68896588-02bc-4b7f-869d-01a052f82c8a
- [Example in the test pipeline](https://dev.azure.com/azure-sdk/internal/_build/results?buildId=1887002&view=logs&j=01d5a77e-6e85-5280-0e9f-af5629bd443f&t=1e514895-521a-57d8-4030-b9d8f4f46b20&l=3870)

### Managed Identity credential was rejected by the storage service
- **API version:** 2.1
- **Thrown when:** in the LRO GET request when polling `CreateCustomFormModelOperation`.
- **Frequency:** only appears around one day every 1.5 weeks but affects multiple v2.1 tests on the day it appears. It's always accompanied by errors "Could not access Azure blob storage account" described above.
- **Error message example:**

Azure.RequestFailedException : Invalid model created with ID 5a1a952e-58a6-4d5c-80db-d6e79696f49b Status: 200 (OK) ErrorCode: 2012

Additional Information: error-0: 2012: Managed Identity credential was rejected by the storage service.

Content:

- **Request ID reference:** 690adcf4-c115-4d80-adc4-b483ebb6921d
- [Example in the test pipeline](https://dev.azure.com/azure-sdk/internal/_build/results?buildId=1887002&view=logs&j=01d5a77e-6e85-5280-0e9f-af5629bd443f&t=1e514895-521a-57d8-4030-b9d8f4f46b20&l=5561)

### Operation exceeded maximum processing time
- **API version:** 2.1
- **Thrown when:** in the LRO GET request when polling `CreateCustomFormModelOperation`.
- **Frequency:** usually accompanies errors "Could not access Azure blob storage account" and "Managed Identity credential was rejected by the storage service" described above but only affects one or two tests.
- **Error message example:**

Azure.RequestFailedException : Invalid model created with ID c97a11b9-b61f-4077-8d07-4cca37e4a254 Status: 200 (OK) ErrorCode: 3013

Additional Information: error-0: 3013: Operation exceeded maximum processing time.

Content:


- **Request ID reference:** 74723f9b-b239-4bcd-ae86-4a4225353070
- [Example in the test pipeline](https://dev.azure.com/azure-sdk/internal/_build/results?buildId=1887002&view=logs&j=01d5a77e-6e85-5280-0e9f-af5629bd443f&t=1e514895-521a-57d8-4030-b9d8f4f46b20&l=371)

## Action items

In order to prevent errors `InvalidRequest` and `3014` from breaking the pipeline daily, we are suppressing them with the `IgnoreServiceError` attribute in our test project. The attribute is set on the class level (instead of single method) because it can happen on any test that builds a model, which includes most of our tests.

Once the service has fixed this issue on their side, we must remove those attributes from the following classes:
- `DocumentModelAdministrationClientLiveTests`
- `DocumentAnalysisClientLiveTests`
- `DocumentAnalysisSamples`
- `FormRecognizerSamples`
- `FormTrainingClientLiveTests`
- `OperationsLiveTests`
- `RecognizeCustomFormsLiveTests`
v-xuto commented 1 year ago

@kinelski What is the current progress on this issue?

joseharriaga commented 1 year ago

What is the likelihood that a test that encounters one of these issues would pass if retried?

I've been seeing flaky responses from the text analytics service too, and:

Here's what I did:

  1. Reported them to the service team.
  2. Created the RetryOnErrorAttribute (based on some code that Jesse shared with me 😊). It's basically a duplicate of the RetryAttribute from NUnit, and the only differences are: 2.1. Instead of retrying on failed asserts, it retries on an error (such as an exception) combined with a configurable condition. 2.2. If a test continues to fail with the same pattern after a configurable number of tries, the test is marked as inconclusive.
  3. I put this attribute in the core test framework so other libraries can re-use it.
  4. Created the RetryOnInternalServerErrorAttribute for the specific use case of text analytics. Notice how I check for three different known patterns as part of the retry condition.

I wonder if something like this would help here?

github-actions[bot] commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @ctstone @vkurpad.