electric-sql / electric

Sync little subsets of your Postgres data into local apps and services.
https://electric-sql.com
Apache License 2.0
6.44k stars 155 forks source link

Create persistence layers that persist the Shape Log for a `ShapeStream` instance for faster loading of shapes (and offline loading) #1519

Open KyleAMathews opened 3 months ago

KyleAMathews commented 3 months ago

The Typescript ShapeStream class is responsible for reading the Shape Log from the Electric server and feeding it to other code that'd like to do something with the stream e.g. the Shape class which creates an in-memory materialized view of the Shape as a JS Map.

Shape Logs are cached in Electric (and in an http cache proxy (e.g. nginx/cdn/etc) if you're using one) so they're generally cheap to load. But still, if shapes are larger than a few mbs, persisting shape logs to disk would speed up loading — especially for people on limited mobile networks.

The implementation for this would be:

On Initializing:

  1. ShapeStream checks if there's a persisted log. If so it loads that, the offset, and shapeId for the log.
  2. Then it makes a fetch to the server like normal to catchup with the shape log on the server

On receiving new shape log messages:

  1. ShapeStream would emit these to subscribers and then append them to the persisted shape log.

Compaction — Shape Logs fill up eventually with a lot of redundant information (e.g. 100s of updates to a single row). So periodically compacting the log is necessary to keep reading the log fast. Compaction is basically reading through the log from the start and merging together repeated operations on a single row. When a row is deleted, we can remove all log messages. When compaction is finished, the log is left with one insert operation / row in the shape.

Considerations

  1. benchmark this — sometimes client storage can be slow, particularly on mobile. Sometimes reading from the network is faster than reading from a local cache. We'd need to benchmark carefully (on desktop and mobile) the perf of persisting vs. not. Perhaps it is always faster in which case persisting can be the default. Or perhaps it's always faster on desktop so could be enabled there. Regardless, benchmarking will be a key part of the design process.
  2. what storage option to use? What form should we store them in? Indexeddb & opfs are the two options. Indexeddb is more widely available than opfs so should be the option unless opfs is vastly faster.
  3. Subscribers to ShapeStream wait for a up-to-date control message to let them know that they're caught up with the server and can now inform their subscribers (e.g. a UI component). When reading the persisted log offline, the ShapeStream needs probably to be told it's offline so that it can emit up-to-date when it gets to the end of the persisted log.
samwillis commented 2 months ago

I would argue that this functionality should be a layer above ShapeStream. ShapeStream is a nice thin abstraction of the protocol, and perfect for feeding into other stores. If that store has persistence, ShapeStream doesn't need any itself.

Maybe a PersistedShapeStream? It should be composable with a store implementation you could pass it (IndexedDB, OPFS, Node FS)?

KyleAMathews commented 2 months ago

Ooo yes! I've been uneasy about throwing this into ShapeStream and persistence is a natural layer to compose above it.

const streamWithPersistence = new ShapeStreamOPFS()
const shape = new Shape(streamWithPersistence)
balegas commented 2 months ago

We've also talked about not actually storing the log, but just the last value for the keys, since the developer shouldn't build on historical events, since you can never have the guarantees that you get a full history.

msfstef commented 2 months ago

I can pick this up - my suggestion is a composable ShapeStreamPersister that accepts a ShapeStream and another instance with a specified Storage interface (perhaps set, get, delete? something sensible), so either we or anyone else can add any storage option (we can start with indexeddb or local storage)

I also agree with Valter, which touches on the compaction mentioned by Kyle above - I think what we should do is materialize the log as we ingest it while also keeping track of the last offset seen, and when it's time to restore the stream from DB the materialized data is converted into a series of inserts (like we do for snapshots in the backend).

Issues regarding FK checks, check constraints etc apply for the compacted log coming from the backend anyway so doing this in the local store should not be a separate issue.

KyleAMathews commented 2 months ago

Sounds like a great plan @msfstef!