Azure / azure-sdk-tools

Tools repository leveraged by the Azure SDK team.
MIT License
111 stars 176 forks source link

Cross-language, cross-library tests timing out for China cloud #6175

Open heaths opened 1 year ago

heaths commented 1 year ago

Because our test pipeline agents run from somewhere in the US and do, perhaps, to some other issues, many of our tests across libraries and languages are timing out against the China cloud e.g., Azure/azure-sdk-for-net#34641 (manual test run, but indicative of tests-weekly runs).

Discussing this offline, we could expose some variable in cloud-specific files e.g., https://github.com/Azure/azure-sdk-tools/blob/main/eng/common/TestResources/clouds/AzureChinaCloud.json, that get plumbed through to clients.

heaths commented 1 year ago

/cc @benbp @joshlove-msft @christothes @jsquire

richardpark-msft commented 1 year ago

@heaths, to "fix" this would you then do something special if you detected you were using the China cloud (ie, tune retry timeouts?).

heaths commented 1 year ago

I honestly don't know if that would help. Maybe mitigate some, but it's an arms race at that point. If we can run agents closer to the their cloud, that would be best.

richardpark-msft commented 1 year ago

@heaths, is that the approach you were outlining here? (it mentions an offline discussion)

Discussing this offline, we could expose some variable in cloud-specific files

richardpark-msft commented 1 year ago

(also, agreed that moving our tests to run inside the cloud or nearer is the right option)

heaths commented 1 year ago

@benbp is the mastermind here. I might've been eluding to some way to say "use a longer timeout", but Ben was going to see if we could run agents closer to their cloud to mitigate the high latency that is likely causing this.

benbp commented 1 year ago

@heaths We could spin up a southeastasia agent pool. I would need to do a bit of yaml plumbing first to get our cloud configs to target agent VM regions. CC @mikeharder

heaths commented 1 year ago

Any way we could test that it would make a meaningful difference before doing all that work? Could you or I (happy to help) make some changes in a PR that would force it and just test those against China's cloud (remove the others, for example)?

benbp commented 1 year ago

@heaths the easiest way to test would be:

  1. Spin up a new agent pool (5-10 mins, manual)
  2. Update the test matrix json to reference the Asia agent pool
  3. Run pipeline with matrix filter for fewer OS jobs (since I'm only going to provision 1 agent for now)
  4. Run pipeline with only the China stage selected (select Stages at the Run widget)