Azure / azure-dev

A developer CLI that reduces the time it takes for you to get started on Azure. The Azure Developer CLI (azd) provides a set of developer-friendly commands that map to key stages in your workflow - code, build, deploy, monitor, repeat.

MIT License

393 stars 187 forks source link

Report a crash or hang

Design and implement a solution around generating logs/info about azd hanging or crashing when running any commands.

Hang scenarios

User run azd up -t templateName and it never ends. It keeps blocking the terminal forever. User goes to Azure Portal and see all the services deployed and the applications working as expected, but azd is still waiting on something. (stuck at the end)
User run azd <command> <args> and it hangs blocking console until user kills the process or close terminal. User inspect the expected output and it is incomplete ( missing resources, or missing deployment, or missing settings, etc). (stuck in the middle).
User run some command and azd hangs without doing any progress (stuck from beginning).

Crash scenarios

Similar to the hanging, azd might crash or panic at the end, middle or start.

General concepts and Open Questions

azd can be run in parallel from multiple consoles
azd can create multiple environments and run commands for each environment.
What/How each scenario should be handled?
How can users report and share data to azd team to investigate?

Proposals

Timeout + file logs

Review all I/O calls from azd and make sure there's a default timeout for each of them. This is to prevent app hang.
Create log files inside the environment (by default for public preview)
- Create a file with an auto-generated name like .log
- Write logs to the file while commands are running
Add an option to the environment command to turn logs on file off/on ( so people can opt out on public preview of this)
When a context-timeout error occurs, print in console the path to the log-file and a link to create a new issue in the repo. Ask users to add the log in the issue.

For handling panic scenarios:

Use defer to set a function to print the name of the logs file if the command did not finish.
The function would be invoked before terminating the app, and user would know what file to use to ask for help

@vhvb1989 thank you for writing this proposal and starting the discussion. I think we are moving in the right direction here. Some thoughts, in no particular order:

I support having timeouts apply systematically to all external calls. Note though that sometimes choosing the right timeout value for a given operation is very difficult. For example, Azure deployments can take essentially arbitrarily long time (ask me how I know). Perhaps a combination of relatively short timeouts for selected operations that we know should complete quickly, plus an overall "this is how long the whole azd invocation can take" timeout (overridable by the user) is the best compromise.

Speaking of timeouts, the UX rule of thumb is that any command taking more than a couple of seconds should have some way of providing progress information. Our current has some room for improvement there.

I also think creating a log files is a good idea, although I am not sure if allowing the user to change whether they are created per-environment is a good idea. I would rather have them as a per-user setting, with a command-line parameter to override the default. I would also consider having log creation on-by-default for preview builds of azd, and off-by-default for "production" builds of azd, and keeping log files only for failed azd invocations. That is, upon successful invocation, the log file would be immediately deleted. And I would not keep them in case of a "user error" (e.g. parameter validation failed), assuming that our error messages are reasonably clear and actionable.

Printing the path to log file upon failure is a great idea. Giving the user a link to create an issue on GH when azd fails--I would advise against that. VS Code has that facility for extensions, and we used to use it, but we abandoned it for most of our extensions after a while. The reason is, big majority of resulting bugs was just very low-quality, with users hitting "create issue" link not bothering to read the error message, even if it was clear and actionable. Nowadays are getting much better, if less voluminous, information from users who know how to find our repo via VS Code Marketplace, and, for error elimination work, we augment that with information from telemetry.

Hope this helps!