electric-sql / electric

Sync little subsets of your Postgres data into local apps and services.
https://electric-sql.com
Apache License 2.0
6.5k stars 156 forks source link

Bring down the compute cost of evaluating WHERE clauses #1746

Open balegas opened 2 months ago

balegas commented 2 months ago

There is a large compute overhead for evaluating shape where clauses.

What are the performance numbers we're looking into? The shape matching algorithm should be able to evaluate a single operation against millions of shapes. The algorithm should work well for cases where a single operation only touches few shapes:

We have a few ideas of optimizations we could do:

We expect apps to not have a lot of shape definitions, but having at least one shape that has millions of instances, e.g. user data shapes. It makes sense to optimize those kinds of shapes if necessary.

thruflo commented 2 months ago

Terminology nit: for me "match single write with millions of shapes" is confusing as I associate "match" with actually matching the row to the shape. I suggest we "evaluate" / "compare" / "filter" the row against millions of shapes but only match on the ones that ... match.

Then a couple of notes:

  1. in the bullets above, we don't touch on adjustments to the actual algorithm -- I would assume that reducing the times a where clause needs to be evaluated to do the matching would be a core aspect of this
  2. "we should remain performant with 50" -> perhaps we can be more specific about the requirement for how performance can acceptably degrade with more matched shapes; i.e. we don't actually know whether we will have workloads with larger match numbers, in these cases, are we happy to explode if there are 51? or do we want performance to degrade on a certain curve?
KyleAMathews commented 2 months ago

With writes out of the critical path, the main question there is how big of bursts can we handle and how much sustained write throughout can we handle before we fall apart. We also should be testing various hard drive setups as well to see how much that impacts things.

balegas commented 2 months ago

I would assume that reducing the times a where clause needs to be evaluated to do the matching would be a core aspect of this

Regardless of optimisation we can do to specific types of shapes, we still need the system to be able to handle a certain number of where clauses.

perhaps we can be more specific about the requirement for how performance can acceptably degrade with more matched shapes

That's a good point, but my intention is to say that we can handle a lot more than what people would typically use. We can get to performance degradation in the future

thruflo commented 2 months ago

Regardless of optimisation we can do to specific types of shapes, we still need the system to be able to handle a certain number of where clauses.

Sorry, I don’t follow. My point is about the algorithm for efficient matching. Not about the type of shapes.

For example, shape 1 has where a and b. Shape 2 has where b and c. Somehow only run the “where b” part of the check once …