Easier self-service debugging of issues caused by bad input data (negative multiplicities, duplicated primary keys, etc.)

MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.

https://materialize.com

Other

5.72k stars 466 forks source link

Easier self-service debugging of issues caused by bad input data (negative multiplicities, duplicated primary keys, etc.) #3027

Open umanwizard opened 4 years ago

umanwizard commented 4 years ago

When working with a user who observed incorrectly duplicated records as well as crashes in Materialize, we ultimately discovered that they were suffering from #3026 . This involved a few hours of confusion and manual debugging, which is a poor user experience.

Ideally, we would come up with a way to make this process much smoother.

quodlibetor commented 4 years ago

could we do something like keep a bloom filter of hashes of all messages received and warn/increment a metric every time we get a hit? Could we get a reasonable signal to noise ratio taking various unique indexes into account?

umanwizard commented 4 years ago

You could, but you’d still be consuming memory proportional to each element. I think it’d be simpler to just keep around a set of all the primary keys you’ve seen, in that case.

quodlibetor commented 4 years ago

you’d still be consuming memory proportional to each element

Isn't the proportion logarithmic instead of linear, though?

umanwizard commented 4 years ago

I'm not an expert on bloom filters, but my understanding is that they use memory (linearly) proportional to the number of elements in the set.

quodlibetor commented 4 years ago

Yeah in retrospect I don't actually know how that could possibly work out mathematically. Looking it up, they take a pretty small mount of space, but possibly several bytes per element to get down to a level of certainty that is worth having.

frankmcsherry commented 4 years ago

I'm +1 on the idea that we make Materialize good for this. Not that we necessarily automatically do it for folks, but we did eventually largely use MZ to diagnose the problem. There were some false starts, where we used SELECT and so only got point-in-time answers, and where we had to dance around MZ's optimizer wrt primary keys, but I think in principle many of the issues could have been tracked down by someone with a recipe book of MZ queries to issue.

umanwizard commented 4 years ago

@frankmcsherry IMO this will be much easier to debug with Materialize if/when nested records ever land. Because then your downstream Debezium-flavored view will just be some relational transform over your "avro decode" and "kafka ingest" views, which the user here would have be able to write queries against directly, had they existed.

cuongdo commented 4 years ago

Can we introduce something like MySQL's CHECK TABLE, which is a manually triggered way to validate a table? Our equivalent would be a new command like CHECK SOURCE, which reads all data from a source and checks for invalid multiplicities and other issues that would impede view maintenance by Materialize.