andymatuschak / orbit

Experimental spaced repetition platform for exploring ideas in memory augmentation and programmable attention
https://withorbit.com
Other
1.71k stars 54 forks source link

Project: data architecture improvements #192

Closed andymatuschak closed 3 years ago

andymatuschak commented 3 years ago

As I implemented the API for Orbit, I noticed that I made syncing very cost-inefficient by giving each task state its own hash graph. Since I'll need to run a full-database migration to change this (see #188), I figured I should simultaneously make any other changes I'd like to see in the data model—these migrations are a hassle.

A few key things I'd like to do:

  1. Prepare the data model for future variation: non-prompt tasks and different scheduling strategies. In particular I'd like to eliminate the distinction between ActionLog and PromptActionLog, and I'd like to make PromptState become TaskState. Tasks should be able to specify an "interaction type" (memory, exercise, activity, etc) in addition to describing the contents; the interaction type will determine the feedback semantics and the default scheduling behavior. I'll at least partially think through how e.g. an incremental reading client, and a timeful text might be represented.

  2. If possible, unify the three stored types (logs, prompts, and attachments) as "objects," so that storage / syncing modules can treat them more uniformly.

  3. I'll sketch out how edits to prompts and organization operations (tags, folders) might work, just to make sure I'm not backing myself into a corner.

I'll start by just writing out new interfaces and playing with those. Then I'll perform the full migration. Then I'll have to work #188.

andymatuschak commented 3 years ago

As I sketch out new data structures, I notice that my approach is highly entangled with the important question of how to efficiently sync these structures (i.e. see #188). So I’ll have to discuss both at once.

I also notice just how much complexity I’ve caused for myself by framing the data model as a hash graph, with stable content-addressed identifiers and immutable contents. Yes, such structures have many excellent theoretical properties, but boy do they make it more difficult to iterate, and that is absolutely the wrong trade to be making for this project right now.

Some notes on data structure desiderata:

Modulo the complication of the many-to-one relationship between task states and task content, my instinct would be to simply combine all these elements. The tractability of that approach really depends on how we store, sync, and query tasks.

I’ll now describe three approaches to syncing which seem plausible to me.

(A much simpler) event sourcing model

The log-driven state model in the current system is more typically called an “event sourcing” model. The idea is basically: you don't transmit the state of the system; you transmit actions, then compute an effective state locally. State is always just a cache, computed over the action stream.

This doesn’t have to be terribly complex, but our current model includes several decisions which dramatically increase the complexity:

All these things create nice properties! But they also drastically increase the complexity of the system, both to maintain and to iterate.

We could create a much simpler event sourcing model in which:

Attachments would still need to be fetched out-of-band on native clients, which is a nuisance for maintaining consistency. But this would be much, much easier to maintain than our current model.

This doesn’t imply much change to our data stores: we’d still use LevelDB / IDB to store and query. We can still read/write to Firestore for now on the server, but this much dumber model will be easier to write a self-hosted backend for.

Sync states, not events

An even “dumber” approach would be to sync the computed state snapshots, rather than the events which produced them, with last-writer-wins or very simple conflict resolution logic.

The simplest form of this would be to consider each task as a single document, with a modification time. To sync: keep track of the last time you synced with a server; send/request all documents modified after that time.

These documents could even be serialized as simple files on disk. We could implement a trivial “backend” via rsync to GCS. People could self-host with Dropbox! Mobile and web clients would use LevelDB / IDB directly as their document store, rather than the file system, so I suppose we’d need a separate rsync-like backend for them to sync.

Orbit clients would need to separately maintain an efficient index (e.g. of tasks by due time) in order to present their interfaces efficiently. This turns out to be more possible than I’d thought. I was somewhat surprised to find while investigating this model that even running through Node.js, I can lstat 50k files on my several-year-old MacBook Pro in about 200ms. I’d expected using flat files to be totally intractable… but I guess maybe it’s not.

Computing server indexes (e.g. for notifications) becomes more complicated in this regime: we’d need to attach cloud functions to GCS file changes and update a queryable database accordingly. Not so bad.

A variant of this model, which would be somewhat more resilient to conflicts, would be to model each task as a folder comprising content.json, review/TIMESTAMP-1.json, review/TIMESTAMP-2.json, review/TIMESTAMP-3.json, etc. That is, the data modeling each specific review action would be broken out into its own file. In this model, reviews performed on separate devices on the same task (a fairly common case) would not cause a conflict. The downside is increased complexity and an order of magnitude more files to deal with syncing. The latter could be a real problem if we use something like rsync naively, since its protocol involves sending over a complete file list with timestamps/hashes.

This would be quite an invasive change to both server and client. I’d probably begin with this model by implementing the all-in-one task-as-document model, using LevelDB/IDB as store, since the mobile/web clients need that anyway. Then we could add a file system-based store and better conflict resiliency in time.

One interesting advantage of this model is that clients don’t need to contain the complex logic necessary to reconstruct a task state from its actions. If you wanted to write something that interacted with Orbit data structures from, say, Python, it’d be much easier this way.

Lean on CouchDB

Both of these approaches involves implementing our own syncing protocol to varying degrees. I can’t help feeling like this is silly, and something we should really try to avoid. Isn’t there a general-purpose syncable data store we can use?

This is basically the mission of Apache CouchDB. CouchDB is written in Erlang, so it’s something we’d run on the server; PouchDB is a Javascript implementation. The big draw of the system is that it implements its own replication, certainly much more robustly than anything we’d manage. The database model itself is much like LevelDB, so we already know the model is a decent fit for our needs. PouchDB would store the database in native LevelDB bindings in our native apps and IDB on the web.

We could use CouchDB to replicate a data model based on either of the two approaches above (event-based or document-based). If we wanted to expose a file system interface for the latter, we could always use FuseFS or something.

Another interesting advantage of CouchDB is that it’s a general replication protocol (like rsync), so anyone could run their own CouchDB server, point Orbit at it, and voila! Self-hosted sync, with no extra effort. This is great!

My primary concerns with adopting CouchDB:

  1. We’d have to operate the servers ourselves—monitoring, availability, scaling, sharding, the whole shebang. I don’t know in advance how much effort this would be, and I also don’t know how many cloud VMs we’d need to offer acceptable performance (so this might be quite expensive… I don’t think so, but I can’t tell). It seems like a shame to do this when hosted cloud databases are so cheap and simple. As it happens, there is a hosted solution for CouchDB (Cloudant), but it’s shockingly and prohibitively expensive for this project.
  2. CouchDB is quite complex and quite general. It handles syncing for us, but we’d be buying into a lot of surface area. And because it’s general, it would not be as efficient as some of the options above, which can be more tightly scoped to our needs. For instance, it (like our current model) maintains a distinct revision graph for each document. We don’t need that complexity, but we’ll pay for it on every sync.

It would also create some extra complexity on our backend. The semantics of its current authentication model would mean giving each user their own database. It’s not possible to make queries across databases, so we’d need to run a service which selectively replicates active user databases into an admin-read-only “merged” database to run cross-user queries (e.g. for notifications). We’d also need to build some glue to bridge our existing authentication system into its internal user management system. I don’t think either of these things would be too onerous, but they do add complexity and failure surface.


I’m going to let this simmer for the weekend. Each approach has significant advantages and disadvantages; there isn’t a clear winner in my mind at the moment.

andymatuschak commented 3 years ago

Reflecting on this over the weekend, one thing which makes me lean towards event-sourcing (or at least away from document-based methods) is that for research analysis purposes, I really do need a log of events which took place. It's not even enough to just separate out all the review actions, since I'll almost certainly also be interested in events like "user deleted this prompt" (was that because it wasn't interesting/meaningful? this is an important behavior to examine!).

So if I go the document-based model, I end up needing to implement diffing to extract the events. The combination is perhaps still less complex than the event-sourcing model, but it strikes me as brittle/lossy, and it does seem to be giving up much of the simplicity the document-based model enjoyed.

kirkbyo commented 3 years ago

Very interesting. There isn't a clear winner in my mind either, but a few points came to mind as I was reading:

andymatuschak commented 3 years ago

Thanks, those are yet more good reasons to lean towards a relatively minimalistic event sourcing model! I'm digging in on the next level of detail now…

andymatuschak commented 3 years ago

Last night I pushed the work I’ve got so far on a new data model and approach to syncing. It’s in a new package called store, which is also (temporarily) holding a new version of core called core2.

Overall, I’ve tried to use this opportunity to simplify things a great deal, while leaving us more room to expand in the near future. Key differences in the data model:

The broad architecture for these data types is the same, but made somewhat more general. I’ve implemented a standardized event sourcing data store. What that means is that the only real things in this system are events, which describe changes to entities. The only type of entity right now is a task, but future entities might include collections and tags. Task data structures aren’t modified directly; they’re computed from the event sequence through a “reducer” which defines how an event modified an entity. For the sake of performance, these snapshots are computed when writing new events to the data store, saved, and indexed. But they should be viewed as a cache; events are the source of truth.

Events and entity snapshots are stored in a database that’s meant to be an accessible file format. My intent is to make the store library a public API. If you want to write a desktop script / app which interacts with Orbit (like our note syncing script), the best way to do it is to have that script simply read/write the local data store. Then, if it wants, it can explicitly sync the local store with the server. This way changes from local scripts can be reflected offline, and without a round trip to a server API. The server API becomes something to use in contexts where you wouldn’t have a full local store, like a highlighting mashup embedded into a web page.

Syncing is pretty straightforward in this model:

  1. Send the server all the events which occurred after the last event we sent.
  2. Request all the events after the one we last received.
  3. When receiving new events, update the corresponding entities’ snapshots.

But “after” is a bit tricky here: you don’t want to make a syncing system whose correctness depends on accurate and monotonic client clocks. There are lots of ways to get well behaved distributed clocks, but for now I’ve chosen an exceedingly simple approach, which should work fine for us, though it means we can’t easily sync peer-to-peer—only client-server. My approach is: client-side, a simple sequence number functions as a local logical clock; server-side, we rely on Firebase’s ServerTimestamp, which uses Google TrueTime to produce monotonic timestamps. In practice this means simple data types for our clocks and a simple comparison for “after.”

We do still need to rely on local timestamps for the spaced repetition algorithm itself, which depends on clock time deltas. My simple approach to the distributed clock problem means there could be events which violate causation (ie the logical timestamps and client timestamps are reversed), but I don’t think that really matters in practice, so long as the reducers are well defined. If necessary, we could solve this by using a hybrid logical clock, which can capture both approximate relative clock times and strict causative ordering.

Still to do:

Phew. That’s a lot! Help with any/all welcome.

Cc @kirkbyo

andymatuschak commented 3 years ago

Unfortunately I've discovered today (thanks, tests!) that Firestore's serverTimestamp doesn't actually use TrueTime as I thought, so we can't use it to get a free monotonic clock.

This is a bit tricky, since (at least in our current architecture) API requests are potentially distributed to many independent servers. We'll have to arrange monotonicity using transactions or using something like a hybrid logical clock salted with random bits (to accommodate multiple server nodes).

andymatuschak commented 3 years ago

The salted HLC does seem to be fine for the purposes of our stable listing contracts. Linking here for posterity: https://github.com/andymatuschak/orbit/blob/master/packages/backend/src/backend/2/orderedID.ts

andymatuschak commented 3 years ago

Some updates: sync is now implemented (in a new package, sync). I've also implemented migration routines; existing API endpoints now simultaneously write data structures in the new format (this'll help us test and verify).

andymatuschak commented 3 years ago

The new scheduler, syncing module, and backend implementation are now all used end-to-end in the local configuration. Incremental syncing seems to be working nicely.

Still working through pushing the new types through all of app (currently "backporting" types outside the model layer).

andymatuschak commented 3 years ago

app and web-component are now fully migrated to core2 in the core2-app branch. That's it for the key packages! I've also implemented BigQuery logging for the new data types.

Next up: I'll deploy the write-twice cloud functions, migrate my own user data, and live on the core2 app for a few days to make sure I don't see anything unexpected.

andymatuschak commented 3 years ago

I've migrated my own user now and worked through a lot of small issues which arose through that work. Double-writes appear to be working as expected.

A full cold sync of my database is pretty slow: 24s for a few tens of thousands of events. Of course, no one else has a large database, so this isn't an issue in practice. But this'll need to be optimized if the platform gets real use. The easiest thing would be to build a "cold start" API which lets you directly download / upload the on-disk representation.

I'll watch for errors as I review in the next few days, then I'll run the migration across all users.

I can't deploy the new app until a few more small details get wrapped up:

andymatuschak commented 3 years ago

Migrated all users today, fixing a number of small bugs in the process. I haven't yet deployed the core2-based app; I'll monitor data from the write-twice service for the next 12 hours or so first.

andymatuschak commented 3 years ago

Notification logic has been migrated, and I've deployed the core2 app. Things look to be holding up OK so far.

I migrated note-sync to core2 today, which turned out to be an enormous hassle.

That's pretty much it: just some odds and ends now…

andymatuschak commented 3 years ago

Alright, I'm going to close this. Everything else can be done at our leisure. Hooray! I'm quite happy with how this came out.