Azure / azure-sdk-tools

Tools repository leveraged by the Azure SDK team.
MIT License
109 stars 166 forks source link

[tsp-client] `tsp-client init` stuck in Pipeline #8368

Open XiaofeiCao opened 4 weeks ago

XiaofeiCao commented 4 weeks ago

Symptom

Java SDK has our own code generation pipeline to generate SDK from TypeSpec: https://dev.azure.com/azure-sdk/internal/_apps/hub/ms.vss-build-web.ci-designer-hub?pipelineId=2238&nonce=e6ID6FSUknWJb/xkwTih5Q%3D%3D&branch=main

Recently the pipeline stuck at the last output line and timed out after 60 minutes.

Screenshot 2024-06-03 at 14 41 57

Failed Run: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3839396&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=356bb04c-cb4a-5f04-82ca-d3b102917eba

Last successful run was about two weeks ago: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3761145&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=356bb04c-cb4a-5f04-82ca-d3b102917eba

Reproducing the bug

rerun the pipeline and it'll stuck: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3887788&view=results

Env

vmImage: ubuntu-20.04 python --version: Python 3.10.14 nodejs: 18.20.3

command: npx tsp-client init --debug --tsp-config https://github.com/Azure/azure-rest-api-specs/blob/9df71d5a717e4ed5e6728e7e6ba2fead60f62243/specification/informatica/Informatica.DataManagement/tspconfig.yaml

When run locally on my machine, there's no stuck. SDK Automation seems fine too.. https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3839388&view=logs&j=a8a7a537-82b0-583c-7971-bac70b9822ca&t=37e3947b-3cfb-5d36-86ba-0e22bb7dbc33&l=1222

Expected behavior

We know that this is our own pipeline, though just wondering how the stuck occurred from tsp-client perspective. Would like some insight on the issue. Thanks!

weidongxu-microsoft commented 4 weeks ago

Is it intermittent, or always fail on certain RP?

XiaofeiCao commented 4 weeks ago

Is it intermittent, or always fail on certain RP?

It always fails for informatica. I'll give it a try for other RPs.

XiaofeiCao commented 4 weeks ago

Same for deviceregistry.. https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3842824&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9

XiaofeiCao commented 2 weeks ago

Seems hanging at compile https://github.com/Azure/azure-sdk-tools/blob/b7c22df944c532ef622fdf523ccb901f39f53d73/tools/tsp-client/src/typespec.ts#L100-L106

XiaofeiCao commented 1 week ago

Latest finding: When there's error in tsp file, the pipeline won't stuck, and throws error successfully: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3888615&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=356bb04c-cb4a-5f04-82ca-d3b102917eba&l=118

Screenshot 2024-06-20 at 16 50 57
XiaofeiCao commented 1 week ago

Directly run tsp-client, also stuck:

Screenshot 2024-06-20 at 17 09 51

https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3888751&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9

pipeline definition: https://github.com/Azure/azure-sdk-for-java/blob/67f89cb9be55f4fcadc534d5dd6c867750ba8fa8/eng/mgmt/automation/generation.yml#L51-L54

XiaofeiCao commented 1 week ago

Latest finding: Use macos-13 succeeded without blocking... https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3889023&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=356bb04c-cb4a-5f04-82ca-d3b102917eba

weidongxu-microsoft commented 1 week ago

You may change the vm if really need to (but maybe Windows instead of Mac)

XiaofeiCao commented 1 week ago

You may change the vm if really need to (but maybe Windows instead of Mac)

Yeah, seems macos resource is limited: https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/hosted?view=azure-devops&tabs=yaml

I saw windows is using Git Bash for bash: https://learn.microsoft.com/en-us/azure/devops/pipelines/scripts/cross-platform-scripting?view=azure-devops&tabs=yaml#consider-bash-or-pwsh

I'll try that.

XiaofeiCao commented 1 week ago

Also stuck on windows bash: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3892604&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=356bb04c-cb4a-5f04-82ca-d3b102917eba

Tried PowerShell, also stuck: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3892700&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=1a5cc010-7735-550a-9d76-c0b745122dab

I don't understand. tsp-client is run directly without python: image

Let me try using tsp-client command used by sdkautomation..

XiaofeiCao commented 1 week ago

stuck even with sdkautomation command... (tsp-client init --local-repo) https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3893256&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=ffe5b61a-4918-59e6-2d77-95255068933d

catalinaperalta commented 1 week ago

Does it get stuck with every library or just some? Is there any issues reported when the debug level is set?

weidongxu-microsoft commented 6 days ago

stuck even with sdkautomation command... (tsp-client init --local-repo) https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3893256&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=ffe5b61a-4918-59e6-2d77-95255068933d

What's the difference of your experiment branch with the current sdk automation (I assume they are still fine)?

XiaofeiCao commented 6 days ago

Does it get stuck with every library or just some?

For my pipeline, it's every library.

Is there any issues reported when the debug level is set?

No, the last output was:

Screenshot 2024-06-26 at 10 17 48

The command was simple, e.g.: https://github.com/Azure/azure-rest-api-specs/blob/7605afe88e3201dc25ce0881c2e49fe1b6bbdd54/specification/mongocluster/DocumentDB.MongoCluster.Management/tspconfig.yaml

XiaofeiCao commented 6 days ago

stuck even with sdkautomation command... (tsp-client init --local-repo) https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3893256&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=ffe5b61a-4918-59e6-2d77-95255068933d

What's the difference of your experiment branch with the current sdk automation (I assume they are still fine)?

Currently I'm not seeing major differences... We all use node 18.20.x: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3910442&view=logs&j=a8a7a537-82b0-583c-7971-bac70b9822ca&t=37e3947b-3cfb-5d36-86ba-0e22bb7dbc33&l=181

One thing I noticed is that sdk automation runs in AzureCLI. Though also tried AzureCLI with same result. https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3910442&view=logs&j=a8a7a537-82b0-583c-7971-bac70b9822ca&t=37e3947b-3cfb-5d36-86ba-0e22bb7dbc33&l=3 https://github.com/Azure/azure-rest-api-specs-pipeline/blob/master/.azure-pipelines/templates/RunSDKAutomation.yml#L10

catalinaperalta commented 5 days ago

So it just hangs at the compile function? Without any errors being returned? Also, I still need to understand, is this always happening in this pipeline? For every library?

XiaofeiCao commented 5 days ago

So it just hangs at the compile function? Without any errors being returned?

Yes. No error's returned. It just hangs for 60 minutes and timeout.

Screenshot 2024-06-27 at 14 25 55

Like this pipeline: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3914963&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=356bb04c-cb4a-5f04-82ca-d3b102917eba

This hanging task is just calling tsp-client init:

- bash: |
    npx tsp-client init --tsp-config $(TSP_CONFIG) --debug
  displayName: '[Experiment] run tsp-client directly'
  condition: eq(variables.fromTypeSpec, true)

https://github.com/Azure/azure-sdk-for-java/compare/main...mgmt_directly_call_tsp-client

is this always happening in this pipeline? For every library?

Yes. It hangs for every library for this pipeline.

catalinaperalta commented 4 days ago

I'm adding @timotheeguerin to see if he has any ideas since we're failing in the typespec compile call. Maybe there's something we can do to further debug the compile step. Tim, here is the link to the compile function: https://github.com/Azure/azure-sdk-tools/blob/d471d1f370dcdc696d995eb2b41dd0ac4ef95fb3/tools/tsp-client/src/typespec.ts#L55

Likewise I tested the init command directly in powershell with the sphere library and had no issues. My node version is 20.11.0 and I'm on a windows 11. The command I ran: npx tsp-client init --tsp-config https://github.com/Azure/azure-rest-api-specs/blob/7a41d14c661171b4fffec5863c51fb70529ee1db/specification/sphere/Sphere.Management/tspconfig.yaml --debug

Since tsp-client is successfully running in the automation pipelines, it seems there's some new configuration in this pipeline that's causing the issue or we might be inputting some unexpected data into the tool. Could be worth a look to see if there's anything in the other pipeline configurations that could help resolve this issue. By the point we get to the compile call in tsp-client we've already finished up with all of the tsp project cloning, installing the emitter deps, etc. So at that point we're just waiting to get the compiled library back from the @typespec/compiler

catalinaperalta commented 4 days ago

@XiaofeiCao could you also share an example of the tspconfig.yaml url or path you're passing into the command?

XiaofeiCao commented 4 days ago

@XiaofeiCao could you also share an example of the tspconfig.yaml url or path you're passing into the command?

Sure, here it is: https://github.com/Azure/azure-rest-api-specs/blob/7605afe88e3201dc25ce0881c2e49fe1b6bbdd54/specification/mongocluster/DocumentDB.MongoCluster.Management/tspconfig.yaml

You may also try your own url in my pipeline by clicking Run New in my stuck pipeline and in variables, replace TSP_CONFIG with your own. image

XiaofeiCao commented 4 days ago

@catalinaperalta Another interesting finding is that the pipeline won't stuck in macOS vmImage..