dotnet / aspire

An opinionated, cloud ready stack for building observable, production ready, distributed applications in .NET
https://learn.microsoft.com/dotnet/aspire
MIT License
3.41k stars 357 forks source link

Unhelpful error message if DCP dependency check fails #4691

Closed karolz-ms closed 1 week ago

karolz-ms commented 2 weeks ago

If the dependency check that uses dcp info command fails, with dcp returning non-zero code, an exception is thrown. The way we process this exception here https://github.com/dotnet/aspire/blob/main/src/Aspire.Hosting/Dcp/DcpDependencyCheck.cs#L105 cause all output of the dcp command to be discarded.

The result is the user only sees a message that says "dcp info returned non-zero exit code", which is not helpful:

image

We should instead include all the output from dcp in the log, so that the user is informed about why the dependency check failed.

JamesNK commented 2 weeks ago

Have you considered having dcp return specific error codes for certain situations? For example, return exit code X if Docker isn't healthy and then the host can check for known error codes and print a friendly error message to the host console. It would look better than an error type + message + stack trace.

The fallback for unknown exit codes would be what you suggested and include dcp output in the host console.

karolz-ms commented 2 weeks ago

Agreed, I do not think we want the stack trace here.

We could consider having specific error codes for common issues, but ultimately DCP should be able to tell the user what part of dependency check failed, and what the user might want to try to get themselves unblocked.

radical commented 2 weeks ago

Some hosting tests fail randomly on CI with Application orchestrator dependency check had an unexpected error System.TimeoutException: The operation has timed out. but no other information in the log. This issue - https://github.com/dotnet/aspire/issues/4640 - would also benefit from getting the details in the log.

davidfowl commented 2 weeks ago

Wait is this a dupe of https://github.com/dotnet/aspire/issues/4089?

cc @danegsta

JamesNK commented 2 weeks ago

It was thought to be fixed. But I got an unhelpful error message when testing locally with latest main source code.

JamesNK commented 2 weeks ago

FYI, this is the result from dcp info:

{"version":"0.5.4","commitHash":"a2453505df8d60a288a1fdb30004e11ca8854b22","buildTimestamp":"2024-06-20T22:49:08Z","containers":{"runtime":"docker","hostName":"host.docker.internal","installed":true,"running":false,"error":"Docker CLI timed out while checking status. Ensure Docker CLI is functioning correctly and try again."}}

Could it have been fixed in dcp but dcp wasn't yet updated on dotnet/aspire main?

danegsta commented 2 weeks ago

@JamesNK was your Docker stuck in resource saver mode when you originally hit the exit code scenario? I’ve been trying to reproduce with various states of Docker installation/running states, but haven’t been able to get an exit code response yet.

JamesNK commented 2 weeks ago

Yes, it was in resource saver mode. I had to restart my machine to make it healthy.

davidfowl commented 2 weeks ago

I just did that!

danegsta commented 2 weeks ago

Okay, my money is on us not explicitly handling a cancellation error somewhere and letting it bubble all the way up.

danegsta commented 2 weeks ago

This looks to be a simple mismatch between timeouts; DCP has a 30 second timeout configured for Docker commands, while Aspire has a 25 second timeout on the info command. We're returning a non-zero exit code because Aspire kills the DCP info check before it completes.

danegsta commented 2 weeks ago

I've got a DCP PR in progress to reduce our default timeout on the info command + add a config flag that Aspire can use to override the timeout. Implementing that flag in Aspire would have to wait for a future release though, since it won't be supported in the 8.0 DCP release.

danegsta commented 2 weeks ago

An updated DCP build has been merged with info timeout changes that should fix the cause of this particular non-zero exit code, but it's also worth improving the timeout logic in Aspire to provide a better error message (we probably shouldn't be throwing an exception reporting that a process exited unexpectedly when we're the ones killing the process).