Project: data architecture improvements

andymatuschak commented 3 years ago

As I implemented the API for Orbit, I noticed that I made syncing very cost-inefficient by giving each task state its own hash graph. Since I'll need to run a full-database migration to change this (see #188), I figured I should simultaneously make any other changes I'd like to see in the data model—these migrations are a hassle.

A few key things I'd like to do:

Prepare the data model for future variation: non-prompt tasks and different scheduling strategies. In particular I'd like to eliminate the distinction between ActionLog and PromptActionLog, and I'd like to make PromptState become TaskState. Tasks should be able to specify an "interaction type" (memory, exercise, activity, etc) in addition to describing the contents; the interaction type will determine the feedback semantics and the default scheduling behavior. I'll at least partially think through how e.g. an incremental reading client, and a timeful text might be represented.
If possible, unify the three stored types (logs, prompts, and attachments) as "objects," so that storage / syncing modules can treat them more uniformly.
I'll sketch out how edits to prompts and organization operations (tags, folders) might work, just to make sure I'm not backing myself into a corner.

I'll start by just writing out new interfaces and playing with those. Then I'll perform the full migration. Then I'll have to work #188.

andymatuschak commented 3 years ago

As I sketch out new data structures, I notice that my approach is highly entangled with the important question of how to efficiently sync these structures (i.e. see #188). So I’ll have to discuss both at once.

I also notice just how much complexity I’ve caused for myself by framing the data model as a hash graph, with stable content-addressed identifiers and immutable contents. Yes, such structures have many excellent theoretical properties, but boy do they make it more difficult to iterate, and that is absolutely the wrong trade to be making for this project right now.

Some notes on data structure desiderata:

We’d like efficiently queryable task state data structures.
- The most important element of these state structures is the current scheduler state—i.e. when the task is next due, its last review time, etc.
These state data structures must be associated in some way with the content of the tasks (e.g. the question/answer data), either through reference or through inlined containment.
- That content must be editable over time, though ideally with a preserved reference to its original provenance, so that we can e.g. group prompts related to a particular article, even when they’re edited.
- One complication which argues for separating task content and task state is that some content may actually specify multiple tasks, each with distinctly tracked state. For instance, a cloze deletion string may be reused across several tasks; changing that string affects all those tasks.
  - We need to be able to efficiently determine the relationship between these cloze deletion subtasks, so that we can e.g. not show multiple deletions from the same cloze in the same review session.
- Task content can specify an interaction type, which influences the language and layout of the interface we display, as well as the default scheduling behavior. For instance, memory tasks use “remembered” / “forgot” buttons, but reflective activities should use different language (and probably different default scheduling).
- Task content may come in different formats, such as “Q/A prompt”, “cloze deletion”, “plain text string” (e.g. for a reflective activity or generic task), “URL reference” (e.g. for a reading list), etc.
- Task content can specify scheduler hints or custom schedules.
- Task content can reference attachments, also managed by the data store.

Modulo the complication of the many-to-one relationship between task states and task content, my instinct would be to simply combine all these elements. The tractability of that approach really depends on how we store, sync, and query tasks.

I’ll now describe three approaches to syncing which seem plausible to me.

(A much simpler) event sourcing model

The log-driven state model in the current system is more typically called an “event sourcing” model. The idea is basically: you don't transmit the state of the system; you transmit actions, then compute an effective state locally. State is always just a cache, computed over the action stream.

This doesn’t have to be terribly complex, but our current model includes several decisions which dramatically increase the complexity:

Syncing is designed to support arbitrarily many replicas, with no privileged central server.
The arrow of time is represented like Git, with hashed parent references. So computing the “diff” between two nodes requires walking a graph (much of which may not be available locally).
We assume that there are no trustworthy clocks.
To make that graph-walk more complex, each task has its own HEADs.
Task content is stored separately from the action graph, in its own content-addressed data structures. Ditto attachments.
We rely on a stable, immutable content-addressed hashing pattern. In particular, we rely on being able to compute a consistent ID for a given log we might have in memory (rather than just storing an arbitrary opaque ID with the data structure).

All these things create nice properties! But they also drastically increase the complexity of the system, both to maintain and to iterate.

We could create a much simpler event sourcing model in which:

Syncing assumes one primary central replica.
Time is represented by hybrid logical clocks; syncing is a matter of storing the last-known server “clock time” and making a simple >= query.
We inline prompt content into the “ingest” action, so that we avoid separate treatment for prompt content objects.
Tasks are addressed by opaque identifiers rather than a consistent hash of task content, simplifying editing operations.
“Conflict resolution” is a simple matter of replaying all events affecting a task, in HLC clock-time order, which is close enough to true causal-time order for our needs.

Attachments would still need to be fetched out-of-band on native clients, which is a nuisance for maintaining consistency. But this would be much, much easier to maintain than our current model.

This doesn’t imply much change to our data stores: we’d still use LevelDB / IDB to store and query. We can still read/write to Firestore for now on the server, but this much dumber model will be easier to write a self-hosted backend for.

Sync states, not events

An even “dumber” approach would be to sync the computed state snapshots, rather than the events which produced them, with last-writer-wins or very simple conflict resolution logic.

The simplest form of this would be to consider each task as a single document, with a modification time. To sync: keep track of the last time you synced with a server; send/request all documents modified after that time.

These documents could even be serialized as simple files on disk. We could implement a trivial “backend” via rsync to GCS. People could self-host with Dropbox! Mobile and web clients would use LevelDB / IDB directly as their document store, rather than the file system, so I suppose we’d need a separate rsync-like backend for them to sync.

Orbit clients would need to separately maintain an efficient index (e.g. of tasks by due time) in order to present their interfaces efficiently. This turns out to be more possible than I’d thought. I was somewhat surprised to find while investigating this model that even running through Node.js, I can lstat 50k files on my several-year-old MacBook Pro in about 200ms. I’d expected using flat files to be totally intractable… but I guess maybe it’s not.

Computing server indexes (e.g. for notifications) becomes more complicated in this regime: we’d need to attach cloud functions to GCS file changes and update a queryable database accordingly. Not so bad.

A variant of this model, which would be somewhat more resilient to conflicts, would be to model each task as a folder comprising content.json, review/TIMESTAMP-1.json, review/TIMESTAMP-2.json, review/TIMESTAMP-3.json, etc. That is, the data modeling each specific review action would be broken out into its own file. In this model, reviews performed on separate devices on the same task (a fairly common case) would not cause a conflict. The downside is increased complexity and an order of magnitude more files to deal with syncing. The latter could be a real problem if we use something like rsync naively, since its protocol involves sending over a complete file list with timestamps/hashes.

This would be quite an invasive change to both server and client. I’d probably begin with this model by implementing the all-in-one task-as-document model, using LevelDB/IDB as store, since the mobile/web clients need that anyway. Then we could add a file system-based store and better conflict resiliency in time.

One interesting advantage of this model is that clients don’t need to contain the complex logic necessary to reconstruct a task state from its actions. If you wanted to write something that interacted with Orbit data structures from, say, Python, it’d be much easier this way.

Lean on CouchDB

Both of these approaches involves implementing our own syncing protocol to varying degrees. I can’t help feeling like this is silly, and something we should really try to avoid. Isn’t there a general-purpose syncable data store we can use?

This is basically the mission of Apache CouchDB. CouchDB is written in Erlang, so it’s something we’d run on the server; PouchDB is a Javascript implementation. The big draw of the system is that it implements its own replication, certainly much more robustly than anything we’d manage. The database model itself is much like LevelDB, so we already know the model is a decent fit for our needs. PouchDB would store the database in native LevelDB bindings in our native apps and IDB on the web.

We could use CouchDB to replicate a data model based on either of the two approaches above (event-based or document-based). If we wanted to expose a file system interface for the latter, we could always use FuseFS or something.

Another interesting advantage of CouchDB is that it’s a general replication protocol (like rsync), so anyone could run their own CouchDB server, point Orbit at it, and voila! Self-hosted sync, with no extra effort. This is great!

My primary concerns with adopting CouchDB:

We’d have to operate the servers ourselves—monitoring, availability, scaling, sharding, the whole shebang. I don’t know in advance how much effort this would be, and I also don’t know how many cloud VMs we’d need to offer acceptable performance (so this might be quite expensive… I don’t think so, but I can’t tell). It seems like a shame to do this when hosted cloud databases are so cheap and simple. As it happens, there is a hosted solution for CouchDB (Cloudant), but it’s shockingly and prohibitively expensive for this project.
CouchDB is quite complex and quite general. It handles syncing for us, but we’d be buying into a lot of surface area. And because it’s general, it would not be as efficient as some of the options above, which can be more tightly scoped to our needs. For instance, it (like our current model) maintains a distinct revision graph for each document. We don’t need that complexity, but we’ll pay for it on every sync.

It would also create some extra complexity on our backend. The semantics of its current authentication model would mean giving each user their own database. It’s not possible to make queries across databases, so we’d need to run a service which selectively replicates active user databases into an admin-read-only “merged” database to run cross-user queries (e.g. for notifications). We’d also need to build some glue to bridge our existing authentication system into its internal user management system. I don’t think either of these things would be too onerous, but they do add complexity and failure surface.

I’m going to let this simmer for the weekend. Each approach has significant advantages and disadvantages; there isn’t a clear winner in my mind at the moment.

andymatuschak commented 3 years ago

Reflecting on this over the weekend, one thing which makes me lean towards event-sourcing (or at least away from document-based methods) is that for research analysis purposes, I really do need a log of events which took place. It's not even enough to just separate out all the review actions, since I'll almost certainly also be interested in events like "user deleted this prompt" (was that because it wasn't interesting/meaningful? this is an important behavior to examine!).

So if I go the document-based model, I end up needing to implement diffing to extract the events. The combination is perhaps still less complex than the event-sourcing model, but it strikes me as brittle/lossy, and it does seem to be giving up much of the simplicity the document-based model enjoyed.

kirkbyo commented 3 years ago

Very interesting. There isn't a clear winner in my mind either, but a few points came to mind as I was reading:

It seems like the choice between event sourcing and a document-based solution is not mutually exclusive since the document-based solution can still be implemented as another client in the future.
- The state of their Orbit could be modelled as a series of snapshots on disk, and a diff could be constructed to sync any mutations to the server. Self-hosting could be achieved with a lightweight local backend, making the disk client queryable for the desktop client.
- Like you pointed out though, choosing a document-based solution as the primary format would introduce a brittle layer of complexity.
Lean on CouchDB
- Looks great, but as you mentioned, I worry that the dev-ops for this solution might be cumbersome.
- The fact that we need to create an admin-read-only merged database seems like we are already working "against the grain" for a solution we would like to be optimized for querying.
- The self-hosting benefits seem negligible since I am doubtful that many folks would want to spin up and maintain their own DB.

andymatuschak commented 3 years ago

Thanks, those are yet more good reasons to lean towards a relatively minimalistic event sourcing model! I'm digging in on the next level of detail now…

andymatuschak commented 3 years ago

Last night I pushed the work I’ve got so far on a new data model and approach to syncing. It’s in a new package called store, which is also (temporarily) holding a new version of core called core2.

Overall, I’ve tried to use this opportunity to simplify things a great deal, while leaving us more room to expand in the near future. Key differences in the data model:

There’s no longer a difference between a “task” and a “prompt task.” Instead, there are tasks, which have content (which might be a prompt), and a type indicating the default scheduling behavior (ie SM0 at the moment for memory; perhaps even intervals for more generic tasks; etc)
Task data (eg prompt contents) is no longer stored separately from the task itself.
Previously, a cloze deletion prompt would be represented by N tasks. Now it’s represented by 1 task, with N “components.” Instead of a combined URL-style task ID (someID/3) which must be parsed to distinguish cloze regions, there’s now a single task ID, a separate component ID when appropriate.
Logs are now called events, and there’s no separate ActionLog and PromptActionLog: just Events.
We no longer rely on IDs of tasks or logs/events being generated through content hashes.
Cloze prompts no longer rely on parsing regions out of their content. Instead the regions are specified as structured data. This gives us the freedom to change the syntax later, and to support overlapping or simultaneous deletions.

The broad architecture for these data types is the same, but made somewhat more general. I’ve implemented a standardized event sourcing data store. What that means is that the only real things in this system are events, which describe changes to entities. The only type of entity right now is a task, but future entities might include collections and tags. Task data structures aren’t modified directly; they’re computed from the event sequence through a “reducer” which defines how an event modified an entity. For the sake of performance, these snapshots are computed when writing new events to the data store, saved, and indexed. But they should be viewed as a cache; events are the source of truth.

Events and entity snapshots are stored in a database that’s meant to be an accessible file format. My intent is to make the store library a public API. If you want to write a desktop script / app which interacts with Orbit (like our note syncing script), the best way to do it is to have that script simply read/write the local data store. Then, if it wants, it can explicitly sync the local store with the server. This way changes from local scripts can be reflected offline, and without a round trip to a server API. The server API becomes something to use in contexts where you wouldn’t have a full local store, like a highlighting mashup embedded into a web page.

Syncing is pretty straightforward in this model:

Send the server all the events which occurred after the last event we sent.
Request all the events after the one we last received.
When receiving new events, update the corresponding entities’ snapshots.

But “after” is a bit tricky here: you don’t want to make a syncing system whose correctness depends on accurate and monotonic client clocks. There are lots of ways to get well behaved distributed clocks, but for now I’ve chosen an exceedingly simple approach, which should work fine for us, though it means we can’t easily sync peer-to-peer—only client-server. My approach is: client-side, a simple sequence number functions as a local logical clock; server-side, we rely on Firebase’s ServerTimestamp, which uses Google TrueTime to produce monotonic timestamps. In practice this means simple data types for our clocks and a simple comparison for “after.”

We do still need to rely on local timestamps for the spaced repetition algorithm itself, which depends on clock time deltas. My simple approach to the distributed clock problem means there could be events which violate causation (ie the logical timestamps and client timestamps are reversed), but I don’t think that really matters in practice, so long as the reducers are well defined. If necessary, we could solve this by using a hybrid logical clock, which can capture both approximate relative clock times and strict causative ordering.

Still to do:

[x] Implement an IndexedDB backend for web clients. This is already stubbed out, and some tests in database.test.ts can be modified to run against both backend. Dexie.js looks like it’ll do most of the work here.
[x] Implement an AttachmentStore, meant to live alongside the Database.
[x] Better tests for the event reducer.
[x] Implement the Database interface (or an appropriate subset) on the backend, implemented against Firestore.
[ ] New HTTP APIs for the new types
- [x] get events
- [x] put events
- [x] get attachments
- [x] put attachments
- [ ] get tasks
[x] Actually implement the sync algorithm I sketched above, which if we’ve done the rest of the work right should be trivial
[x] Adopt new core types throughout app.
- [x] model
- [x] ReviewSession
- [x] EmbeddedScreen
- [x] ui
[x] User data migration
- [x] Write-both strategy during migration testing period
- [x] Backfill script
[x] BigQuery logging for new data types
[x] Clean up and remove old code and endpoints

Phew. That’s a lot! Help with any/all welcome.

Cc @kirkbyo

andymatuschak commented 3 years ago

Unfortunately I've discovered today (thanks, tests!) that Firestore's serverTimestamp doesn't actually use TrueTime as I thought, so we can't use it to get a free monotonic clock.

This is a bit tricky, since (at least in our current architecture) API requests are potentially distributed to many independent servers. We'll have to arrange monotonicity using transactions or using something like a hybrid logical clock salted with random bits (to accommodate multiple server nodes).

andymatuschak commented 3 years ago

The salted HLC does seem to be fine for the purposes of our stable listing contracts. Linking here for posterity: https://github.com/andymatuschak/orbit/blob/master/packages/backend/src/backend/2/orderedID.ts

andymatuschak commented 3 years ago

Some updates: sync is now implemented (in a new package, sync). I've also implemented migration routines; existing API endpoints now simultaneously write data structures in the new format (this'll help us test and verify).

andymatuschak commented 3 years ago

The new scheduler, syncing module, and backend implementation are now all used end-to-end in the local configuration. Incremental syncing seems to be working nicely.

Still working through pushing the new types through all of app (currently "backporting" types outside the model layer).

andymatuschak commented 3 years ago

app and web-component are now fully migrated to core2 in the core2-app branch. That's it for the key packages! I've also implemented BigQuery logging for the new data types.

Next up: I'll deploy the write-twice cloud functions, migrate my own user data, and live on the core2 app for a few days to make sure I don't see anything unexpected.

andymatuschak commented 3 years ago

I've migrated my own user now and worked through a lot of small issues which arose through that work. Double-writes appear to be working as expected.

A full cold sync of my database is pretty slow: 24s for a few tens of thousands of events. Of course, no one else has a large database, so this isn't an issue in practice. But this'll need to be optimized if the platform gets real use. The easiest thing would be to build a "cold start" API which lets you directly download / upload the on-disk representation.

I'll watch for errors as I review in the next few days, then I'll run the migration across all users.

I can't deploy the new app until a few more small details get wrapped up:

[x] re-implement bookkeeping updates to user notification state and activeTaskCount metadata in backend's core2 db hooks
[x] migrate backend's notification logic to core2 … probably some other things I'm forgetting.

andymatuschak commented 3 years ago

Migrated all users today, fixing a number of small bugs in the process. I haven't yet deployed the core2-based app; I'll monitor data from the write-twice service for the next 12 hours or so first.

andymatuschak commented 3 years ago

Notification logic has been migrated, and I've deployed the core2 app. Things look to be holding up OK so far.

I migrated note-sync to core2 today, which turned out to be an enormous hassle.

That's pretty much it: just some odds and ends now…

[x] migrate anki-import
[x] clean up old code
[ ] filter out deleted tasks when building review queues (see #237)
[ ] probably add some convenience API endpoints for /tasks (see #238)

andymatuschak commented 3 years ago

Alright, I'm going to close this. Everything else can be done at our leisure. Hooray! I'm quite happy with how this came out.

andymatuschak / orbit