teskje commented 7 months ago

Feature request

Errors in sources (e.g. Avro decoding errors) can have a serious negative impact on the UX of Materialize. In particular:

They usually make the affected sources and dataflows built upon them un-queryable.
They can increase memory usage when the number of errors is high.

The latter issue is the more problematic one because it is not obviously caused by errors. Users might just see their cluster starting to OOM, without any apparent reason.

Errors in sources are almost certainly caused by some usage error, so we can't prevent them. But we should do a better job at surfacing their existence to users. Today, the only way users can find out about errors in their sources is by querying the affected sources. Instead we should proactively warn them about the existence of errors in the console.

The database should make error counts for all sources conveniently available, e.g., in a SQL relation. The console should show these error counts in an appropriate place, e.g., together with the overall source status.

Related Issues

22430 is about providing a way to efficiently extract errors from any given SQL object, but still requires objects to be queries separately. This issue is about providing an overview of all errors in sources specifically, without also providing the exact error messages. The latter would allow users to become aware of errors, which would prompt them to use the former to dig deeper.

teskje commented 7 months ago

Related discussion in Slack.

guswynn commented 7 months ago

I also want this information exposed! Unfortunately, I'm not sure its trivial!

In the slack we discussed that because storage does not hold the entire dataset for each source in memory/disk/etc (except for UPSERT/DEBEZIUM), the count will be either:

undercounting, where we reset the counts to 0 on restart/crash/etc
overcounting, where we retain the count on restart, and may record an error multiple times as we try to commit it to the data shard

We could transact the source data with the shard that records the error count (mz_source_statistics or otherwise), but I don't think that scales well.

I suspect the correct solution here is two-fold:

prometheus metrics on errors counts; this seems like a non-brainer for debugging
using MFP pushdown and some kind of error-selection syntax to use the persist fast-path to get very fast error values out of a source

teskje commented 7 months ago

using MFP pushdown and some kind of error-selection syntax to use the persist fast-path to get very fast error values out of a source

This is tracked by #22430.

Prometheus metrics would be nice for our monitoring, but probably not particularly useful for users (unless we want to start exposing our metrics).

Once we have #22430, one approach to report sources that have errors could be to have some periodic task that loops over all sources and queries each for errors. If querying sources for errors is cheap enough, that might be workable.

guswynn commented 7 months ago

Once we have https://github.com/MaterializeInc/materialize/issues/22430, one approach to report sources that have errors could be to have some periodic task that loops over all sources and queries each for errors. If querying sources for errors is cheap enough, that might be workable.

I agree!

guswynn commented 7 months ago

Prometheus metrics would be nice for our monitoring, but probably not particularly useful for users (unless we want to start exposing our metrics).

il ensure this issue is filed correctly

MaterializeInc / materialize

Surface errors in sources #24676

Feature request

Related Issues