top errors api - Githubissues

vjeffrey commented 4 years ago

User Story

In order to support the desktop offering view, we need to provide data to support a "top errors" view.

given a time frame, return top x count of errors that occurred across the fleet, for each error include count of machines affected. (number of “top errors” to return determined by request)

note: we should strive to include top errors by machine count (how many machines incurred this error) + top errors by error count (how many times this error occurred)

Considerations

Since this is useful information that can be used in views that are not desktop-only, we should make sure we design this API in a generic way. Please refer to the epic description, q&a doc, and designs to validate requirements. Chat with @apriofrost @chef/superteam about questions. API endpoint should be discussed with team and agreed upon with UX and Product. Let's try to keep most discussion/conclusions documented in this issue so we can reference later

https://docs.google.com/document/d/1OK9glSYfuWdqQwJWY_-neGuREJRS462mH8HH0dUEbT0/edit

https://chef.invisionapp.com/share/E9W3TEFXV87#/screens/406241157_Desktop_Dashboard-V2-Changes_Selected

Aha! Link: https://chef.aha.io/features/SH-530

vjeffrey commented 4 years ago

how do we rank the errors? two ways:

by machine
by error count

let's keep this a bit open ended and start to break off little bits. designs are still in progress.

consider creating a beta endpoint or not exposing endpoint in gateway yet

may need to break this down into subtasks/other cards -- whoever grabs this card should do that

vjeffrey commented 4 years ago

general idea of request and response needs:

request/
days ago: 1
error_count: 2
type: (machine_count || error_count)

response/
[ {"error": "error msg", "count": 50},  {"error": "error msg", "count": 30} ]

question: should we do a "top errors" api and a separate "error details" api, or put it all in one? e.g. "error details API: given an error and a time frame, return the list of machines that incurred the error. list should be sortable." ^ what info do we want about the nodes? platform, environment, ??

vjeffrey commented 4 years ago

standup conversations: error == error message what happens if one error affected a whole ton of machines b/c bad code change, how does the error ever go away?

subtopic: what if we only collect top x errors as latest error per machine?

danielsdeleo commented 4 years ago

How do we expect the customer to use this? Here are some hypothetical scenarios that I believe illustrate the expected usage:

One: Code Defect: The customer pushes a code change with a code defect. On Windows systems that haven't yet installed last week's OS patches, the business application patches will not install. The customer sees an elevated rate of errors (from a notification maybe?). Customer clicks the top error, sees the list of nodes with the error and sees they are all trying to apply revision_id "abc123def456" of the Chef Infra policy and are running Windows 10. Customer reproduces error in the test lab, updates the code and pushes a new change. Customer sees the count of patch install errors decrease as systems run the fixed Chef Infra code. Two: Upstream Service Failure A software vendor's package repo goes offline. The customer sees an elevated error rate, click the top error. From the message, the customer can see that the desktop systems are unable to connect to the vendor's package repo. Solution A: the customer visit's the vendor's status page, is reassured the problem will be fixed and in a little while sees the error rate drop as the package repo service is restored. Solution B: the customer decides to host the software on their company's internal content system, and pushes a code change to install the software from the internal mirror. As the desktops pick up the code change, the error rate decreases. Three: "Background Radiation" of Errors The remote office in Fiji has a flakey network and software downloads fail about 10% of the time. The customer determines (how?) that this rate has increased to 80%. The customer decides to install a local mirror in the Fiji office and updates the Chef Infra code so that Fiji office desktops use the mirror.

The first two cases are (IMO) ideally served by reporting only "unresolved" or "active" errors. That is, when a Chef Client run is successful on a system that was previously affected by the error condition, we should no longer report that system as impacted by the error.

I am unsure if "top errors" is the right conceptual framework for exploring/addressing case 3.

Other considerations:

What about stale unresolved errors? E.g, a system fails its last ever chef run.
Is there an opportunity to have actionable alerts (notifications) based on a concept of counting unresolved actionable errors?

danielsdeleo commented 4 years ago

Another thing I want to discuss is how similar errors should be, and in what ways, in order for us to group them as a single top error. An example of the information we presently collect is in the automate sample date: https://github.com/chef/automate/blob/d85b9925d4a854807c353628d620828e88ded34b/components/ingest-service/examples/converge-failure-report.json#L3284

This includes:

The error (ruby) class: These are all over the place. Some are very specific, like Chef::Exceptions::EnclosingDirectoryDoesNotExist in the linked example. Contrarily, if the errors are coming because the customer put a bare raise statement in the code, these will all be RuntimeError
Message: This also varies widely in quality, but more importantly for this discussion, it's also frequently parameterized. From the linked example, we see text like Parent directory /failed/file does not exist. which is specific to this one instance of the EnclosingDirectoryDoesNotExist condition.

We could also collect more information with varying degrees of effort:

Cookbook name
recipe name
recipe line number
resource name
policy file name and revision

How the Easiest Implementation Choice Might Look in Practice:

Given the way that automate works today, the "default" implementation (read: least engineering effort) would be to aggregate the errors based on error class and message being exactly equal.

In the case of a defect in the customer's Chef Infra code, the most likely behavior is that all Chef runs across all affected machines would fail in an identical way, resulting in identical error class and message.

In the case of an upstream service dependency failure, the error class and message could vary widely. For example, a misbehaving HTTP service can result in a range of low level network errors, timeouts, and HTTP-level errors (4XX and 5XX responses) which all would have distinct error text. It would likely be useful to the end user if these could be attributed to a single cause, but that is a significant engineering challenge. We could attempt to aggregate the errors based on the Chef Infra resource name, but this would not work if many different resources were failing because of a single upstream service.

Given the above, I'd recommend that we start with aggregating errors based on class and message. We should invest in efforts to verify it works acceptably for real customer data.

vjeffrey commented 4 years ago

https://github.com/chef/automate/issues/2873#issuecomment-599705254

vjeffrey commented 4 years ago

curl --silent --insecure -H "api-token: token" https://a2-dev.test/api/v0/cfgmgmt/errors

chef / automate

top errors api #2996

User Story

Considerations

How the Easiest Implementation Choice Might Look in Practice: