getgauge / gauge-dotnet

C# runner for gauge + DotNet standard 6.0-8.0
Apache License 2.0
27 stars 21 forks source link

Failed to start gauge API: Time out connecting to dotnet #241

Closed RemiHoston closed 1 month ago

RemiHoston commented 1 month ago

Is your feature request related to a problem? Please describe.

When run the command gauge run specs the error happened Failed to start gauge API: Time out connecting to dotnet

Describe the solution you'd like Have found this issue occurred very frequently when running on local machine and pipeline agent. When it happened locally I can close the process from the Task Manager, but on pipeline have to retry after a few minutes. So I realized it may be a feature defect in the gauge main logic.

My gauge version is:

 Gauge version: 1.6.9
Plugins
-------
 csharp (0.10.6)
 dotnet (0.7.2)
 html-report (4.3.1)
 screenshot (0.3.0)
 xml-report (0.5.1)

Describe alternatives you've considered It's better to close all the dotnet process when the gauge run command finished.

chadlwilson commented 1 month ago

It sounds like you have diagnosed the problem as a dead or stuck runner.

Do you have a way to reproduce this? Any logs to help debug what might have happened? What environments(s)/OSes are you running from?

RemiHoston commented 1 month ago

Thank you for your attention, in my issues the only useful information is like:

2024-10-17T08:56:51.9589391Z MSBuild version 17.9.8+610b4d3b5 for .NET 2024-10-17T08:57:22.5882646Z Error ---------------------------------- 2024-10-17T08:57:22.5882913Z 2024-10-17T08:57:22.5883196Z [Gauge] 2024-10-17T08:57:22.5883618Z Failed to start gauge API: Timed out connecting to dotnet 2024-10-17T08:57:22.5886914Z 2024-10-17T08:57:24.1956320Z Get Support ---------------------------- 2024-10-17T08:57:24.1962697Z 2024-10-17T08:57:24.2209038Z Docs: https://docs.gauge.org 2024-10-17T08:57:24.2214110Z Bugs: https://github.com/getgauge/gauge/issues 2024-10-17T08:57:24.2218619Z Chat: https://github.com/getgauge/gauge/discussions 2024-10-17T08:57:24.2219847Z 2024-10-17T08:57:24.2221192Z Your Environment Information ----------- 2024-10-17T08:57:24.2222946Z windows, 1.6.9, aff43ef 2024-10-17T08:57:24.2224027Z csharp (0.10.6), dotnet (0.7.2), html-report (4.3.1), screenshot (0.3.0), xml-report (0.5.1)

It happens when a lot of cases are finished running, and a new task to start a new round gauge running immediately. It can be solved by retry several times on the pipeline.

Agent info: |Agent.OS | Windows_NT |   | Agent.OSArchitecture | X64 |   | Agent.OSVersion | 10.0.17763| Hope the Gauge tool can be better and better!

AlbertZhang6 commented 1 month ago

I met the same issue with RemiHoston, Do you have any idea? @chadlwilson

chadlwilson commented 1 month ago

No, unfortunately not - as I don't have any information to replicate it, proper logs from the Gauge itself (logs/gauge.log or similar), or any details from people about what changed when this started happening.

It seems similar to https://github.com/getgauge/gauge-dotnet/issues/196 (which supposedly was fixed in https://github.com/getgauge/gauge-dotnet/issues/197) but may have resurfaced as noted in https://github.com/getgauge/gauge-dotnet/issues/204 and https://github.com/getgauge/gauge-dotnet/issues/199

You could try rolling back to gauge-dotnet 0.5.8. if the problem goes away or you're using async methods, it's likely related to those I guess?

RemiHoston commented 1 month ago

Yea, this is depend on the latest changes about Gauge.CSharp.Lib. Currently, we have referenced the latest version 0.11.3, the plugin for dotnet min version is 0.7.1. It contains a break change.

chadlwilson commented 1 month ago

@RemiHoston Sorry, I don't understand what you are saying. You can roll back the CSharp and/or Gauge version too, right?

RemiHoston commented 1 month ago

Sorry, my bad. I mean the component have referenced in my project is Gauge.CSharp.Lib 0.11.3, which contains break changes in different usage of SuiteDataStore, SituationDataStore and SpecDataStore(These three have become a static class, but before it was controlled by DataStoreFactory). And the version of Gauge.CSharp.Lib 0.11.3 requires the dotnet plugin at least 0.7.1 (if I am correct), so the dotnet 0.5.8 could not run successfully. By the way, what's special that the latest version of dotnet 0.7.2? Could we optimize it? image And if the plugin dotnet is 0.7.2, it can run smoothly: image Since upgrade Gauge.CSharp.Lib from 0.10.3 to 0.11.3 we have done a lot of changes in our gauge test project, and we hope we can use the latest Gauge version.

chadlwilson commented 1 month ago

Let me try asking another way.

What was the last reliable combination of gauge/gauge-dotnet/Gauge.CSharp.Lib versions in your environment, that did not have this problem?

We need to narrow down when a problem started or we will be going around in circles forever guessing. Try to make it easy for maintainers, rather than require them to guess what you have done or changed in your environment or how you might use a piece of software.

RemiHoston commented 1 month ago

Thank you, have checked our history calls, however this version had been used for a very long time.

Gauge.CSharp.Lib.0.7.2

Gauge version: 1.0.8 Commit Hash: 28617ea

Plugins

csharp (0.10.6) dotnet (0.1.7) html-report (4.0.12) screenshot (0.0.1) xml-report (0.2.3)

chadlwilson commented 1 month ago

Unfortunately it can happen for different reasons depending on what type of specs impls you are writing. So what caused the problem earlier may not be what is causing the problem now. If there was a 'good' version somewhere in between that'd be useful to know.

Do you have logs from gauge itself (not from your spec run) you can share?

jensakejohansson commented 1 month ago

If I understand you correctly this happens occasionally? I can just mention that I have seen this problem from time to time for a long time (years). On my own development machine not very often, but when working on a client's laptop I get it a couple of times every day I believe (should maybe start to track this more).

I believed it to be a performance issue. When running larger projects on machines that are a little slow (for lack of a better description) Gauge tends be more unstable. I've also seen the problem i VSC that run/debug options does not appear - the project does not load correctly. Changing to a more powerful machine solved that issue.

I have had the timeout problem in pipelines too (Azure Linux agents), I "solved" it there by increasing the timeouts to some insane amount, since it only happened every now and then and a bigger timeout "solved" it.

gauge config --list (to see timeout options).

Unfortunately I'm too busy with customer project (I'm a consultant) to investigate, so I can only give vague statements, but in my opinion there is some flakeyness going on that should be addressed.

mpekurny commented 1 month ago

I have also seen this issue for a few years, but exclusively running through VSCode. I had always assumed that it may have been a defect in the VSCode plugin. The fix (for me) was to shut down VSCode and then kill all running processes of .NET Host. Reading through this thread, maybe it's a problem where the dotnet plugin isn't getting shut down correctly leaving running processes of .NET Host hanging around until the machine's limitations (memory/cpu) can no longer launch the next .NET Host in a timely manor. It's also possible the async changes have somehow made the existing issue worse, such as consume more resources per run or make it more likely it fails the shut down correctly.

sriv commented 1 month ago

Like Chad said, this can happen for a variety of reasons, but ultimately the error occurs when gauge is not able to connect to the dotnet plugin. The most likely cause is that the dotnet runner crashed for some reason.

Some options to get more details:

  1. run with --log-level=debug, ex- gauge run specs -log-level debug -> this shows more information than normal and can shed some light on what's happening.
  2. look for the details in the logs directory post the error.
  3. look at your windows event log under the "Application" category.
  4. if none of the above give enough hints - try invoking the dotnet plugin manually a. locate the gauge plugins directory (by default these are in %APPDATA%/gauge/plugins/dotnet/<version> IIRC) b. now cd into your project directory and invoke dotnet plugin standalone

ex:

cd c:\path\to\project\root
%APPDATA%\gauge\plugins\dotnet\<version>\bin\dotnet.bat --start

This is the command that gauge invokes, so by manually invoking this you can see the stdout of gauge-dotnet when running your project. My suspicion is that something is going wrong here, and we will know more once we simulate the error.

As for the orphan processes, unfortunately in windows the child process does not die when gauge is killed forcibly. I suspect that is causing this behaviour but it's hard to comment without having a way to replicate the issue.

RemiHoston commented 1 month ago

Thank you guys! Have run like this

gauge run specs -l debug But nothing useful can be output. On the other hand, I found that it is more often in the afternoon than in the morning and the total time consuming for the Gauge-Test-Task is about 48 sec. It seems to be related with the agent resources(CPU usage or disk). So I try to update the runner_connection_timeout (gauge.properties) from 30 sec to 180 sec. Will keep monitor for a long time. Does dotnet8 makes it slower?

RemiHoston commented 1 month ago

And keep monitor for these two days, the most time consuming for the gauge test runner to connect the dotnet api is 46s. And so far no more time out issue has happened. I think my issue has been fixed.