Should `WaitFor` have a default timeout?

afscrome commented 1 week ago

Is there an existing issue for this?

[X] I have searched the existing issues

Is your feature request related to a problem? Please describe the problem.

I left a broken app running for over 20 minutes, and the resource state remained Waiting. Should this have timed out by now and gone into FailedToStart?

Even worse is when the same thing happens during an integration test - the DistributedApplicationTestingBuilder.BuildAsync().StartAsync() method just hangs, with little other information available through the debugger (and even less if the test is running in CI).

Describe the solution you'd like

Whilst I'm less fussed about the hang when the dashboard is accessible as you can find the information you need there (somewhat related to #5544), hanging tests with no diagnostic data is far more painful.

For the test scenario, I'd like to see a default (but overridable) timeout either on individual WaitFor calls, or a timeout on the whole DistributedApplicationTestingBuilder.BuildAsync().StartAsync(). Whilst a user could apply their own timeout by adding WaitAsync(), the test will then fail with a generic The operation has timed out error, which doesn't give you much to go and troubleshoot. Tiemout support built in to aspire could fail with a better error message, that will be captured by test runners & CI systems e.g.

Application failed to start as 'Foo', 'Bar' and 'Baz' failed to reach the Starting state.

Additional context

There is some overlap with #5632, however this issue is a more specifically related to WaitFor .

Also want to stress that despite the deluge of issues I've been logging around WaitFor, I do love what has been implemented so far. Just logging things I'm discovering as I play around this evening.

mitchdenny commented 1 week ago

I'm sympathetic to this request as it caught me out as I was writing the first set of unit tests. The solution is actually to use a cancellation token on the StartAsync method.

https://github.com/dotnet/aspire/blob/1b2e640a366afd68c99b915762416774b8a9be91/tests/Aspire.Hosting.Redis.Tests/RedisFunctionalTests.cs#L21-L68

davidfowl commented 1 week ago

I don't think it should. We're adding start/stop to the dashboard to help mitigate forever running issues. IMO there shouldn't be a default timeout

davidfowl commented 6 days ago

Maybe we need the ability to set a default timeout, but there's none by defuault.

afscrome commented 5 days ago

I'm happy with the for the App Host project itself - as you said there you have the dashboard to investigate what is hanging as well as manually start/stop/restart. (+ any other UI that comes with WaitFor and health checks)

I still think there should be a sanity timeout with DistributedApplicationTestingBuilder (e.g. 5 mins). Whilst it is possible to set a timeout through a custom cancellation token on DistributedApplicationTestingBuilder.StartAsync, that does require you to first realise that you're test is going to timeout. And I'd expect if your application hangs due to something with WaitOn in a test, it's most likely accidental. Even worse is if there's something different with CI so everything passes locally, but fails in Github Actions / Azure Pipelines / whatever tool where you have limited debug options and can't reproduce locally.

That said, if there is an opt-in default timeout, DistributedApplicationTestingBuilder could make use of that to implement such a sanity timeout.

The sanity timeout would need to be set high enough that it gives images enough time to pull over bad network connections, although if it's only configured by default on DistributedApplicationTestingBuilder, then it's may not be the end of the world if it occasionally fails on the edge case you have a poor internet connection, no cached image and are pulling several large new images

dotnet / aspire