Sync notes from markdown files into Orbit

I've been looking into the problem of how to sync SRS prompts from note-taking systems with Orbit.

The goal is to have a CLI tool that:

Takes as input an arbitrary set of notes, most likely a folder full of markdown-formatted text files
Parses those notes for cloze deletion and Q and A style prompts
Syncs those prompts to Orbit

The note-sync package already accomplishes most of this, but it has a few shortcomings:

It depends on a larger set of libraries that Andy wrote to solve a more general problem (computer-supported-thinking, spaced-everything, and incremental-thinking), which makes the code more complex than necessary.
Markdown files are expected to have a certain format consistent with being exported from Bear.
Prompts are being cached locally, which feels unnecessary, both from the perspective of performance and code complexity. (see note below about necessity of caching)

After spending a few hours reviewing the code and testing out the note-sync package, I have a few thoughts/opinions:

Just re-write it. It seems like it would be simpler to do a complete re-write. The one exception is the parsing code in incremental-thinking that parses qaPrompts and clozePrompts. That code is well-tested and has been in use by Andy for a while.
Don't support Anki. I don't think this package should support syncing prompts from markdown notes -> Anki. Given this package's scope is Orbit, I think it should focus on sync with Orbit only. Maybe Anki and Orbit should be able to sync with one another, so that Anki could be used as a review interface? If so, that feels like a separate concern. It looks like there is a package called anki-import that at least handles one-way importing.
Don't cache locally. I don't think this package should do any local caching. Based on the discussion of Idempotency and Identity, we should be able to simply compare hashes of the prompts to know whether a prompt is new or not. We could debate whether the Orbit API should allow a duplicate prompt to be created, but at worst, we only need to grab the hashes for all existing prompts from the Orbit API and then do some hash comparisons. I admit that I don't fully understand the caching code yet, so maybe I'm missing something!

Thoughts on Provenance

How should this library handle provenance? Broadly, I have questions about how Orbit thinks about provenance, but scoping my questions to this library, it appears that the current implementation is caching provenance information locally, but not syncing that provenance information to the Orbit API.

The current implementation depends on a Bear Note ID at the bottom of the markdown file to determine provenance, which is obviously undesirable as notes could be exported from a variety of different note-taking systems.

If we'd like to track provenance, we could use the note's filename and modified date to populate the provenance data Orbit requires. Here's the PromptProvenanceType filled out:

{
  externalID: hash(note.filename),
  title: note.filename,
  modificationTimestampMillis: note.lastModified
}

One gotcha with provenance is based on the way we're handling Idempotency, moving a prompt from one file to another, so long as the prompt didn't change at all, would change the provenance information but not the identity of the prompt itself. That's probably desirable for the prompt, but the provenance information has changed. We'll need to account for that.

Again, there's a general question of whether we need to track provenance at all for this importer.

Thank you for looking at note-sync and for getting this conversation started, Jess!

I agree with you on all the shortcomings. On the opinions:

Just re-write it.

Yes, let's. Probably best to extract the remark parser to an explicit package from incremental-thinking into a package in this repo.

Don't support Anki.

Yep, I agree that Anki sync should be handled orthogonally.

Don't cache locally.

I think you're suggesting that to sync, we simply write logs to the server creating all the prompts found in the Markdown notes. If the server handles idempotency correctly, this would generate the correct behavior! I do worry about performance: it means reading, parsing, and transmitting the prompt content from every note on each sync. I have a 10^3 notes with 10^3 prompts. A few tens of megabytes to parse; maybe a couple MB of data transmitted to the API. Maybe that's OK! Certainly it's a simple way to start.

Thinking purely about a local scenario, imagine you're on your Mac and editing your notes. Ideally, if you then switch into Orbit, you should be able to immediately review the prompts you've just added. I'm not sure what would trigger a sync in this scenario: a file watcher, a frequent timer, or an explicit user action. But I worry that this could be difficult to achieve if we're round-tripping all notes to the server.

Here's an alternative framing which might preserve the simplicity of your suggestion for a local context. I've been working on rearchitecting the data layer of Orbit as a simple syncable file format. So if you download the Orbit app, you'd end up with some Orbit.db in a folder on disk, which the app would read and write, and which could be intermittently synced to the server. It's not a cache, per-se: more a replica. Right now, the app has a (non-shareable) data store, and this script has its own separate cache. But if you think of the app's data store as a real local file format, the script can just write to it directly, and let some other process handle over-the-network syncing.

In this context, "syncing" from Markdown notes might mean:

Parse prompts from all Markdown notes.
Use @withorbit/store to add all prompts to the local data store (new package, spiked but not yet written). Most of these events would be a no-op.
The syncing service later (or perhaps immediately) syncs our local replica to the server. Only meaningful events would be transmitted.

In this scheme, we still pay the price of reading and parsing all local note files each time we sync, but at least we don't have to transmit them all the server.

If you're running in a context where you don't want to maintain a local replica, we can implement a persistence strategy for @withorbit/store which would operate over the network, i.e. via the API.

it appears that the current implementation is caching provenance information locally, but not syncing that provenance information to the Orbit API.

The provenance info is indeed being sent to the API, in the taskMetadata field of the ingest logs. In particular, the Bear note ID is being preserved, along with the note title.

The current implementation depends on a Bear Note ID at the bottom of the markdown file to determine provenance, which is obviously undesirable as notes could be exported from a variety of different note-taking systems.

Yep. When a stable identifier is not available, I can imagine using the note file subpath instead, as you suggest.

The "point" of provenance:

Display human-readable context in the UI (i.e. "where's this from?")
Ideally, allow one-tap access to that original context (doable for Bear, since it has a URL scheme... difficult to achieve generally for plaintext notes!)
Support future clustering presentations in the UI (i.e. group the prompts which came from the same note in list views)

One gotcha with provenance is based on the way we're handling Idempotency, moving a prompt from one file to another, so long as the prompt didn't change at all, would change the provenance information but not the identity of the prompt itself. That's probably desirable for the prompt, but the provenance information has changed. We'll need to account for that.

Right! The syncing script is meant to update the provenance metadata for notes in this instance, when it notices that prompts move between files. The detection of the moves and the corresponding actions for Anki are implemented, but I didn't yet implement the relevant Orbit actions. They do exist (log type updateMetadata).

(see [andymatuschak/orbit#54 note-sync: implement support for “move” operations])

Again, there's a general question of whether we need to track provenance at all for this importer.

I think we can be a bit fuzzy about it, but displaying the context in the UI is pretty important. For instance, yesterday I was writing a note called "Tachistoscope". One paragraph in that note:

After the stimulus is briefly presented, it’s often followed by {a mask} (e.g. {a random pattern of related forms, like jumbled letters for a word stimulus}), intended to {interrupt further processing of the original stimulus}.

This paragraph only makes sense as a cloze deletion if the note's title ("Tachistoscope") is displayed above it.

andymatuschak / orbit

Sync notes from markdown files into Orbit #220

Thoughts on Provenance