Closed vjeffrey closed 4 years ago
how do we rank the errors? two ways:
let's keep this a bit open ended and start to break off little bits. designs are still in progress.
consider creating a beta endpoint or not exposing endpoint in gateway yet
may need to break this down into subtasks/other cards -- whoever grabs this card should do that
general idea of request and response needs:
request/
days ago: 1
error_count: 2
type: (machine_count || error_count)
response/
[ {"error": "error msg", "count": 50}, {"error": "error msg", "count": 30} ]
question: should we do a "top errors" api and a separate "error details" api, or put it all in one? e.g. "error details API: given an error and a time frame, return the list of machines that incurred the error. list should be sortable." ^ what info do we want about the nodes? platform, environment, ??
standup conversations: error == error message what happens if one error affected a whole ton of machines b/c bad code change, how does the error ever go away?
How do we expect the customer to use this? Here are some hypothetical scenarios that I believe illustrate the expected usage:
One: Code Defect:
The customer pushes a code change with a code defect. On Windows systems that haven't yet installed last week's OS patches, the business application patches will not install. The customer sees an elevated rate of errors (from a notification maybe?). Customer clicks the top error, sees the list of nodes with the error and sees they are all trying to apply revision_id
"abc123def456" of the Chef Infra policy and are running Windows 10. Customer reproduces error in the test lab, updates the code and pushes a new change. Customer sees the count of patch install errors decrease as systems run the fixed Chef Infra code.
Two: Upstream Service Failure
A software vendor's package repo goes offline. The customer sees an elevated error rate, click the top error. From the message, the customer can see that the desktop systems are unable to connect to the vendor's package repo. Solution A: the customer visit's the vendor's status page, is reassured the problem will be fixed and in a little while sees the error rate drop as the package repo service is restored. Solution B: the customer decides to host the software on their company's internal content system, and pushes a code change to install the software from the internal mirror. As the desktops pick up the code change, the error rate decreases.
Three: "Background Radiation" of Errors
The remote office in Fiji has a flakey network and software downloads fail about 10% of the time. The customer determines (how?) that this rate has increased to 80%. The customer decides to install a local mirror in the Fiji office and updates the Chef Infra code so that Fiji office desktops use the mirror.
The first two cases are (IMO) ideally served by reporting only "unresolved" or "active" errors. That is, when a Chef Client run is successful on a system that was previously affected by the error condition, we should no longer report that system as impacted by the error.
I am unsure if "top errors" is the right conceptual framework for exploring/addressing case 3.
Other considerations:
Another thing I want to discuss is how similar errors should be, and in what ways, in order for us to group them as a single top error. An example of the information we presently collect is in the automate sample date: https://github.com/chef/automate/blob/d85b9925d4a854807c353628d620828e88ded34b/components/ingest-service/examples/converge-failure-report.json#L3284
This includes:
Chef::Exceptions::EnclosingDirectoryDoesNotExist
in the linked example. Contrarily, if the errors are coming because the customer put a bare raise
statement in the code, these will all be RuntimeError
Parent directory /failed/file does not exist.
which is specific to this one instance of the EnclosingDirectoryDoesNotExist
condition.We could also collect more information with varying degrees of effort:
Given the way that automate works today, the "default" implementation (read: least engineering effort) would be to aggregate the errors based on error class and message being exactly equal.
In the case of a defect in the customer's Chef Infra code, the most likely behavior is that all Chef runs across all affected machines would fail in an identical way, resulting in identical error class and message.
In the case of an upstream service dependency failure, the error class and message could vary widely. For example, a misbehaving HTTP service can result in a range of low level network errors, timeouts, and HTTP-level errors (4XX and 5XX responses) which all would have distinct error text. It would likely be useful to the end user if these could be attributed to a single cause, but that is a significant engineering challenge. We could attempt to aggregate the errors based on the Chef Infra resource name, but this would not work if many different resources were failing because of a single upstream service.
Given the above, I'd recommend that we start with aggregating errors based on class and message. We should invest in efforts to verify it works acceptably for real customer data.
curl --silent --insecure -H "api-token: token" https://a2-dev.test/api/v0/cfgmgmt/errors
User Story
In order to support the desktop offering view, we need to provide data to support a "top errors" view.
note: we should strive to include top errors by machine count (how many machines incurred this error) + top errors by error count (how many times this error occurred)
Considerations
Since this is useful information that can be used in views that are not desktop-only, we should make sure we design this API in a generic way. Please refer to the epic description, q&a doc, and designs to validate requirements. Chat with @apriofrost @chef/superteam about questions. API endpoint should be discussed with team and agreed upon with UX and Product. Let's try to keep most discussion/conclusions documented in this issue so we can reference later
https://docs.google.com/document/d/1OK9glSYfuWdqQwJWY_-neGuREJRS462mH8HH0dUEbT0/edit
https://chef.invisionapp.com/share/E9W3TEFXV87#/screens/406241157_Desktop_Dashboard-V2-Changes_Selected
Aha! Link: https://chef.aha.io/features/SH-530