Azure / azure-dev

A developer CLI that reduces the time it takes for you to get started on Azure. The Azure Developer CLI (azd) provides a set of developer-friendly commands that map to key stages in your workflow - code, build, deploy, monitor, repeat.
https://aka.ms/azd
MIT License
393 stars 187 forks source link

Crash / Hang report generation strategy #281

Open vhvb1989 opened 2 years ago

vhvb1989 commented 2 years ago

Report a crash or hang

Design and implement a solution around generating logs/info about azd hanging or crashing when running any commands.

Hang scenarios

  1. User run azd up -t templateName and it never ends. It keeps blocking the terminal forever. User goes to Azure Portal and see all the services deployed and the applications working as expected, but azd is still waiting on something. (stuck at the end)
  2. User run azd <command> <args> and it hangs blocking console until user kills the process or close terminal. User inspect the expected output and it is incomplete ( missing resources, or missing deployment, or missing settings, etc). (stuck in the middle).
  3. User run some command and azd hangs without doing any progress (stuck from beginning).

Crash scenarios

Similar to the hanging, azd might crash or panic at the end, middle or start.

General concepts and Open Questions

Proposals

Timeout + file logs

For handling panic scenarios:

karolz-ms commented 2 years ago

@vhvb1989 thank you for writing this proposal and starting the discussion. I think we are moving in the right direction here. Some thoughts, in no particular order:

I support having timeouts apply systematically to all external calls. Note though that sometimes choosing the right timeout value for a given operation is very difficult. For example, Azure deployments can take essentially arbitrarily long time (ask me how I know). Perhaps a combination of relatively short timeouts for selected operations that we know should complete quickly, plus an overall "this is how long the whole azd invocation can take" timeout (overridable by the user) is the best compromise.

Speaking of timeouts, the UX rule of thumb is that any command taking more than a couple of seconds should have some way of providing progress information. Our current has some room for improvement there.

I also think creating a log files is a good idea, although I am not sure if allowing the user to change whether they are created per-environment is a good idea. I would rather have them as a per-user setting, with a command-line parameter to override the default. I would also consider having log creation on-by-default for preview builds of azd, and off-by-default for "production" builds of azd, and keeping log files only for failed azd invocations. That is, upon successful invocation, the log file would be immediately deleted. And I would not keep them in case of a "user error" (e.g. parameter validation failed), assuming that our error messages are reasonably clear and actionable.

Printing the path to log file upon failure is a great idea. Giving the user a link to create an issue on GH when azd fails--I would advise against that. VS Code has that facility for extensions, and we used to use it, but we abandoned it for most of our extensions after a while. The reason is, big majority of resulting bugs was just very low-quality, with users hitting "create issue" link not bothering to read the error message, even if it was clear and actionable. Nowadays are getting much better, if less voluminous, information from users who know how to find our repo via VS Code Marketplace, and, for error elimination work, we augment that with information from telemetry.

Hope this helps!