elastic / uptime

This project includes resources and general issue tracking for the Elastic Uptime solution
12 stars 3 forks source link

[META] Severe error reporting UI / Backend #431

Closed andrewvc closed 2 years ago

andrewvc commented 2 years ago

This picks up where https://github.com/elastic/uptime/issues/425 left off.

We need a better UX around severe failures in heartbeat that cannot be correlated to a specific monitor. This was specifically inspired by https://github.com/elastic/uptime/issues/425 , where the failure to download a zip_url, or load the files inside (perhaps due to a syntax error) should create an actionable error in the Uptime UI rather than force the user to look through logs, something we're addressing in the beats in https://github.com/elastic/beats/issues/29692.

To accomplish this we'll need to:

  1. Have heartbeat send error documents to ES rather than logging wherever possible
  2. Design a UI to display these error documents.

Proposed schema for error documents, uses the ECS fields for Error and Event

// datastream: synthetics-errors-default , could be synthetics-events-default if we want this index to be more generic for future uses
{
  @timestamp: "2022-01-01T00:00:00",
  timespan: { ... } // timespan field indicates how long this is relevant for
  event: {
    action: "run-suite",
  },
  error: {
    id: "unique-id-generated-by-application-or-service",
    message: "syntax error"
    type: "synthetics.suite.syntax-error",
    stack_trace: [ ... ], // optional
  },
  monitor: { // all fields here are optional, only necessary if it can be linked to a monitor
    id: "my-monitor",
    name: "My Monitor",
  },
}

CC @liciavale

As a side note, we could consider having this new index be more generic, an events index that could be used for other non-monitor result data than just these errors.

Tasks

andrewvc commented 2 years ago

Having spoken to @liciavale about this, I think we should revisit the idea discussed here: https://github.com/elastic/beats/issues/27924#issuecomment-992861432 . While a generic error facility may be useful, we're straying a good ways from the initial use case. In our conversation we discussed that it makes more sense for the entities on the overview page to mirror those on the monitor management page.

In other words, if the monitor management page shows:

and the overview page shows (as it would with the plan in this issue)

That's just confusing, even if we add a toast somewhere. It would be better just to display the error next to My Suite on both the monitor management page and the overview page.

WRT the detail page on the suite My Suite we can work out those details later, but the critical part would be that the ping history would link to individual journey runs, which should be pretty easy.

If we do add My Suite to the overview page, a question we have to ask is what does its status represent? It could represent simply whether the suite started, or, in addition to that, whether the suite's journeys all completed. The latter seems preferable, but we may have to default to the former for performance reasons.

dominiqueclarke commented 2 years ago

@andrewvc I'm on board with keeping monitor management and overview in sync, however part of the original goal was to find a fix for this issue that also worked in the short term, and ideally also for heartbeat.yml based monitors.

It sounds like we should continue to pursue the generic error facility, and plan the proposed UX into an upcoming suite feature for Monitor Management.

A generic error facility would be useful in the short term, until we release the discovery phase of heartbeat which will allow us to enable suites in Monitor Management in a future beta release. Suites are currently disabled for tech preview. A generic error facility would also be useful for yaml based users, as well as surfacing generic service errors not specific to an individual monitor.

vigneshshanmugam commented 2 years ago

Agree with the confusion part.

Just a thought, What if we establish a relationship between the suite and the final monitors that are created from it and nest them inside the My Suite monitor?

The relationship does already exists and we don't use them anywhere on the UI, I thought this would be a nice opportunity to explore that idea.

andrewvc commented 2 years ago

After speaking with @liciavale I'm closing this in favor of https://github.com/elastic/uptime/issues/435 . Still open to any good reasons not to go in the new direction, but let's move the convo there for now.