Azure / apiops

APIOps applies the concepts of GitOps and DevOps to API deployment. By using practices from these two methodologies, APIOps can enable everyone involved in the lifecycle of API design, development, and deployment with self-service and automated tools to ensure the quality of the specifications and APIs that they’re building.
https://azure.github.io/apiops
MIT License
281 stars 164 forks source link

[FEATURE REQ] Set retry/timeout options in configuration #254

Open UniperMaster opened 1 year ago

UniperMaster commented 1 year ago

Release version

4.1.0 Extractor (Latest)

Describe the bug

Describe the bug When running the extractor pipeline, the extractors is erroring with the following message

he APIM has around 257 APIs and within them all:

anything from 1-50 operations on an API

Nearly all of them have policies configured at the operation level. No policy is configured at the API level

For the APIM itself:

SKU is developer

APIM is using vnet integration (internal)

Diagnostics are configured

Products are used

Subscriptions are scoped at product level

Named values are a mix of secrets, keyvault references and plain text values

The service principal has all the necessary permissions to do any API call to the APIM

No application gateway present

No self hosted gateway

A single backend configured

Api version used to create the APIM was 2021-04-01-preview pipeline.log

Expected behavior

The extractor exports all APIs

Actual behavior

Timesout on the "Exporting API", it does not progress further as the pipeline is terminated

Reproduction Steps

Run the run-extractor.yaml Extract ALL

UniperMaster commented 1 year ago

Error Message

  Writing API operation policy file /home/vsts/work/1/a/artifacts/apis/tableau-management-portal-v21/operations/signin/policy.xml...

crit: Extractor[0] System.AggregateException: Retry failed after 4 tries. Retry settings can be adjusted in ClientOptions.Retry or by configuring a custom retry policy in ClientOptions.RetryPolicy. (The operation was cancelled because it exceeded the configured timeout of 0:01:40. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout.) (The operation was cancelled because it exceeded the configured timeout of 0:01:40. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout.) (The operation was cancelled because it exceeded the configured timeout of 0:01:40. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout.) (The operation was cancelled because it exceeded the configured timeout of 0:01:40. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout.) ---> System.Threading.Tasks.TaskCanceledException: The operation was cancelled because it exceeded the configured timeout of 0:01:40. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout. ---> System.Threading.Tasks.TaskCanceledException: The operation was canceled. ---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled. ---> System.Net.Sockets.SocketException (125): Operation canceled

waelkdouh commented 1 year ago

From the first site, looks like a connection issue as you seem to be running your APIM instance inside a vnet and your DevOps agent may not be able to connect to it. Did you check that?

UniperMaster commented 1 year ago

Nope, it is not in a v-net but doesn't the extractor use the azure rest api to extract the data therefore shouldn't need access to the v-net?

image

waelkdouh commented 1 year ago

@guythetechie any thoughts on this one?

UniperMaster commented 1 year ago

it would seem this is a SKU being overloaded. When I went to the Premium with 2 units it was fine. Occasionally I could get it to work with the Developer SKU 1 unit. If I upped the MaxRetries and Delay but I hardcoded these setting and I'm not sure how to apply it in the appsettings.json and how I would included it in the pipeline as that not cover in the documentation. Any advise would be much appriecated.

guythetechie commented 1 year ago

@UniperMaster - thanks for raising this. We extract all APIs in parallel to maximize performance: https://github.com/Azure/apiops/blob/d4cc9450a77746024bde2ca12decb99122de169b/tools/code/extractor/Api.cs#L28

I can see APIM throttling requests if you're using a low SKU and have a lot of APIs. The easiest way to validate that is by changing ForEachParallel to ForEachAsync in the line above. This will extract the APIs sequentially.

Are you able in a position to make that change to the extractor code and retry?

guythetechie commented 1 year ago

By the way, these are the default retry options. We can expose some of this in configuration so that it's overridable.

UniperMaster commented 1 year ago

@guythetechie , thanks that how I worked out it was a performance issue. I'd like to keep the repo as vanilla as possible, would we be able to have this a configurable option in the pipelines

guythetechie commented 1 year ago

Adding to backlog.

guestdj commented 10 months ago

Is there any updates on this? I am also experiencing the same issue with timeout when extracting from a Developer tier APIM instance. It would also be nice to provide a retry mechanism into the extractor pipeline, rather than having to upgrade to a Basic, Standard or Premium tier

waelkdouh commented 10 months ago

Is there any updates on this? I am also experiencing the same issue with timeout when extracting from a Developer tier APIM instance. It would also be nice to provide a retry mechanism into the extractor pipeline, rather than having to upgrade to a Basic, Standard or Premium tier

We are currently working on implementing other features. Any chance you can submit a PR and we will be more than happy to review and merge.

guestdj commented 10 months ago

Just wanted to update that I have found a solution to the throttling issue.

It turns out that it was due to the platform version set on my APIM instance, not the Developer tier as I thought. The version was set to stv1 which uses a cloud service (classic) architecture, the newer version stv2 uses VM scale sets.

When I tested my extraction against a APIM stv2 instance with the same API's on it, the extraction worked with no issues. So now I have to figure out how to migrate the stv1 instance to stv2, but there is some documentation also on this too.

More info on platform versions can be found here

Hope this helps anyone stuck with the same issue, maybe not that many going forward as I think stv2 is now the default version for new APIM instances.

waelkdouh commented 10 months ago

Just wanted to update that I have found a solution to the throttling issue.

It turns out that it was due to the platform version set on my APIM instance, not the Developer tier as I thought. The version was set to stv1 which uses a cloud service (classic) architecture, the newer version stv2 uses VM scale sets.

When I tested my extraction against a APIM stv2 instance with the same API's on it, the extraction worked with no issues. So now I have to figure out how to migrate the stv1 instance to stv2, but there is some documentation also on this too.

More info on platform versions can be found here

Hope this helps anyone stuck with the same issue, maybe not that many going forward as I think stv2 is now the default version for new APIM instances.

@guestdj thank you for the update. This is great information.

@UniperMaster can you confirm if this resolves your issue as well and close the issue accordingly?

vandanchev commented 8 months ago

Hey guys

I am observing same timeout issues when testing on Premium tier, stv2 APIM (with 1 to 3 units). The migration from stv1 -> stv2 did not solve the issue but slightly improved the situation(more items were processed before the timeouts hit). So setting up timeout setting with configuration would be great feature here.

Also, haven`t tested that but a config option for choosing between parallel or sequential run would also be great.

waelkdouh commented 8 months ago

Hi @vandanchev can you submit a PR so we can review it? I believe it's already an item on our backlog.

UniperMaster commented 7 months ago

Just wanted to update that I have found a solution to the throttling issue. It turns out that it was due to the platform version set on my APIM instance, not the Developer tier as I thought. The version was set to stv1 which uses a cloud service (classic) architecture, the newer version stv2 uses VM scale sets. When I tested my extraction against a APIM stv2 instance with the same API's on it, the extraction worked with no issues. So now I have to figure out how to migrate the stv1 instance to stv2, but there is some documentation also on this too. More info on platform versions can be found here Hope this helps anyone stuck with the same issue, maybe not that many going forward as I think stv2 is now the default version for new APIM instances.

@guestdj thank you for the update. This is great information.

@UniperMaster can you confirm if this resolves your issue as well and close the issue accordingly?

Our Dev instance is STV1, I currently I can't test this at the moment but we are planning on migrating to STV2 soon. When I initially tested it it worked for Premium.

guestdj commented 7 months ago

I have another update on the time-out issue. After doing the platform update on a APIM instance that had a LOT of api's, I started getting the timeout issue again. However I fixed it by upgrading to the Premium Tier, this has now fixed the issue, albeit a very expensive fix. I'm now waiting for the Standard v2 tier that goes GA next April!

waelkdouh commented 7 months ago

@guestdj this is great feedback! Please keep updating this thread which will serve as knowledge base for other in the future.