MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

Surface errors in sources #24676

Open teskje opened 7 months ago

teskje commented 7 months ago

Feature request

Errors in sources (e.g. Avro decoding errors) can have a serious negative impact on the UX of Materialize. In particular:

The latter issue is the more problematic one because it is not obviously caused by errors. Users might just see their cluster starting to OOM, without any apparent reason.

Errors in sources are almost certainly caused by some usage error, so we can't prevent them. But we should do a better job at surfacing their existence to users. Today, the only way users can find out about errors in their sources is by querying the affected sources. Instead we should proactively warn them about the existence of errors in the console.

The database should make error counts for all sources conveniently available, e.g., in a SQL relation. The console should show these error counts in an appropriate place, e.g., together with the overall source status.

Related Issues

22430 is about providing a way to efficiently extract errors from any given SQL object, but still requires objects to be queries separately. This issue is about providing an overview of all errors in sources specifically, without also providing the exact error messages. The latter would allow users to become aware of errors, which would prompt them to use the former to dig deeper.

teskje commented 7 months ago

Related discussion in Slack.

guswynn commented 7 months ago

I also want this information exposed! Unfortunately, I'm not sure its trivial!

In the slack we discussed that because storage does not hold the entire dataset for each source in memory/disk/etc (except for UPSERT/DEBEZIUM), the count will be either:

We could transact the source data with the shard that records the error count (mz_source_statistics or otherwise), but I don't think that scales well.

I suspect the correct solution here is two-fold:

teskje commented 7 months ago

using MFP pushdown and some kind of error-selection syntax to use the persist fast-path to get very fast error values out of a source

This is tracked by #22430.

Prometheus metrics would be nice for our monitoring, but probably not particularly useful for users (unless we want to start exposing our metrics).

Once we have #22430, one approach to report sources that have errors could be to have some periodic task that loops over all sources and queries each for errors. If querying sources for errors is cheap enough, that might be workable.

guswynn commented 7 months ago

Once we have https://github.com/MaterializeInc/materialize/issues/22430, one approach to report sources that have errors could be to have some periodic task that loops over all sources and queries each for errors. If querying sources for errors is cheap enough, that might be workable.

I agree!

guswynn commented 7 months ago

Prometheus metrics would be nice for our monitoring, but probably not particularly useful for users (unless we want to start exposing our metrics).

il ensure this issue is filed correctly