Import/Export: markdown, json, edn

tangjeff0 commented 3 years ago

Discussed in https://github.com/athensresearch/athens/discussions/874

^{Originally posted by **jsmorabito** March 25, 2021} - [ ] markdown - [ ] json - [ ] edn which other file types would be good?

canny[bot] commented 3 years ago

This issue has been linked to a Canny post: Import/Export Markdown :tada:

canny[bot] commented 3 years ago

This issue has been linked to a Canny post: Import/Export to markdown :tada:

Limezy commented 3 years ago

which other file types would be good?

Well... obviously the .transit ! I want to be able to merge two different athens db

sawhney17 commented 3 years ago

One thing I thought about, every app has their own version of Json and EDN, will it be possible to create one that does a catchall evne if they were created in different apps .

julionav commented 3 years ago

Follow this thread for more Jeff talks Jeff episodes

tangjeff0 commented 3 years ago

LOL context is for a Canny demo @ozimos : https://www.loom.com/share/ce0bcd39d102453c8bff076348b4262c

elazar commented 3 years ago

Being able to export to a static HTML site, especially one conducive to applying custom styles, would also be really nice.

seltzered commented 3 years ago

Just trying to articulate this, is this purely around import/export to a format (such as markdown) OR also for possibly facilitating viewing of say, markdown files sitting locally/on a locally-synced storage (e.g. dropbox) akin to obsidian (while still supporting the tracker format)?

Vaguely recall the latter idea mentioned in the discord some time ago.

sam-goode commented 3 years ago

For me, having the ability to export all my pages as markdown files would make me feel more comfortable using Athens. That way I would know if development stops in future or I want to switch to another tool, I can do so relatively easily.

This could also act as a way of backing up files in case of corruption, though arguably one could just backup the tranist file I guess.

Import of markdown files would make it easier to start using Athens when coming from other tools. I spent quite a bit of time copying and pasting stuff when I started out.

Syncing of notes to local files sounds incredibly useful to be able to sync and view them on a mobile device, though I wonder what the complexity and performance implications might be. I guess this would involve making Athens act more like logseq, though the main reason I use Athens over logseq is the performance (logseq seems really slow on my system). If performance weren't an issue, I'd rather have Athens read and write to markdown files stored in a local or remote directory vs the tranist file

bshepherdson commented 3 years ago

I've written a (hacky, best-effort, known-bugs) first crack at exporting to (human friendly, Logseq-compatible) Markdown here.

My dream for Athens is a similar workflow to Logseq, where writes go to both the files and the index, and if the files change separately (eg. Git pull, hand editing, whatever) you can reindex from the files into the database.

So my questions for the Athens team are:

Would such a one-way export be welcome as a supported feature of Athens?
Would such a two-way scheme (with the files as the source of truth and database as a recreatable artifact) be welcome as a supported feature of Athens?
If full two-way writes are not something the Athens team would care to support, what about keeping the current database-first scheme but with (ideally fast, incremental) export/import functions?

filipesilva commented 3 years ago

@shepheb short-form answer to your questions first:

would definitely like better export, but we find it very hard to commit to specific product formats (e.g. logseqs)
definitely can't commit to md as source of truth, we're not sure of the totality of data to keep, and we're also not sure how this should work with RTC
interested in fast, incremental export/import but also think this is actually a harder problem than it looks like

I want to expand upon these though.

First, on formats. It's really hard for us to commit to first party support of any third party format. It's a bit of a losing game as far as format versioning goes, any high fidelity format creation requires intimate knowledge of how that format works.

This sounds a lot like something that could be supported by contributors, but the part about versioning means that changes to other products would make our product broken. This is already a problem with the Roam import.

On its surface this looks like a lowest common denominator problem, where there's an export format that's general enough that a bunch of these tools can use to transfer data between them while still keeping the basic characteristics of knowledge graphs. I don't think this format exists right now and I don't think the incentives are there for the current major players to figure it out and stick to it (emphasis on the latter).

Lacking a lowest common denominator, the other obvious answer is to have a n-to-n conversion between these apps. This still sounds a lot like having that common format anyway, but the key difference is it doesn't need buy-in from the apps. It still needs enough knowledge about the app-specific formats to produce its output though.

The matter of format also doesn't cover synchronisation semantics. These apps are pretty ok at letting your import your stuff in once. But for sync from the format, each app needs to have its own way of incorporating new data as well. The markdown source-of-truth apps already do a great job here. But given that for Athens we can't commit to custom markdown as the substrate, then this is still an issue.

For single user, no concurrent updates, no conflicts, synchronisation could just be modelled as export and then import as a new db on each app change. But this breaks down at scale of data, and also is limited to at best preserving the subset of supported functionality for each import/export format.

I think that fast+incremental import/export is actually the key blocker here for a third-party format sync. Getting each app to provide a sync format is harder than getting them to provide a single-shot import/export format.

The closest I can think of for incremental import/export, where all parties have their incentives aligned, is each party having an API. I think all these apps want to have an API for general purpose integration, and third party sync just ends up being an integration like any other.

At its core those APIs provide a way to change the data (incremental import) and read the data (export). They'd also provide a guarantee that you don't have to know their format intimately to do this, instead you'd have to know their public API.

They wouldn't really provide an incremental export though. That still requires knowledge of the other apps data, and after that it boils down to a diffing problem, albeit a sophisticated one. Something needs to get the other apps data, possibly from their export or API reads.

Also interesting that the incremental import/export problem exists within the same app, on different dbs. Efficient syncing of two different dbs on the same app still is an open problem IMHO. I think the markdown apps might have a better answer for this for now but would like to understand the long-term implications better before feeling confident that's the better approach.

So our current thinking is that incremental import/export is a superset of single-shot import/export, and that the real solution to incremental import/export is APIs and not backup formats.

So where does that leave you, and other people that would actually like this sync to exist? I think the most practical roadmap is something like:

someone makes a n-to-n backup converter for existing apps as a third party tool
- if the app does not have an accessible source-of-truth, users still have the bad user experience of doing full import/export on each cycle
- tool is still limited to its interpretation of the backup formats
- other generic formats can also be supported (e.g. roam-to-csv)
as each app adds an API, this (or another) tool uses the API to effect the export and incremental import for app

I think the places where my reasoning here might fall short is me underestimating the value of the markdown format for non-sync related use cases, and overestimating the specificity of each backup format. But I feel pretty confident that there isn't a generic incremental sync that doesn't avail of some sort of API.

bshepherdson commented 3 years ago

First, thank you for the detailed and thoughtful reply!

I agree that an API is likely the key to universality here. I imagine it can work kind of like the Language Server Protocol compared to the old days of each editor needing a mode for each language. That is, once there are tools that use some common API for export and import, each app can get a lot of functionality for free by hooking up to that API.

The trick of course is to design the API to capture everything necessary without getting too parochial for any one app's approach. I imagine the scope here as for the outliner-style PKMs: Roam, Athens, Logseq, etc. (Obsidian is fundamentally a file editor and doesn't really fit here, for example.)

I sketched up a design that I think captures the needs without imposing too much burden (on the data model, or in code). I think it fits into the roadmap fairly well - it can be used as the foundation for external tools to export and import from each app's format, then integrated into the apps over time.

lahwran commented 3 years ago

That protocol design looks like a great start to my eye. questions for @shepheb most of which are me looking for things to ask and can all be dismissed as "figure out later" if that seems appropriate:

are you imagining any kind of standardization on what IPC/network protocol this message format would happen over? eg, Unix socket, tcp socket, websocket, or just "generalized filesystem-like address"/”ie leave it up to the individual implementations”?
any bug in an app's tracking of which-name-means-which-id, eg a missed move message, could cause the rest of the changes to become invalid. This is especially relevant for data sources. What are your thoughts on including an additional correctness check of some kind? that, or simply include as part of the finalized protocol RFC a warning statement highlighting this issue: that missed messages cannot be fixed automatically and it is up to the implementor to guarantee safety & correctness as they see fit.
related to the previous point, but also relevant for anyone who would like to support round-trip: how would you feel about standardizing a format for globally unique, non-relative id-ish addresses? eg, anywhere :address is required, also allow an optional :id_hint to be used if desired by the receiving application, but which may be ignored. that allows an external application (eg, a tag based filesystem ) to track its own ids for things. or is your answer more or less that external-data-source style implementations of this protocol should always merge/overwrite their owned subtrees when first connecting?
What are your thoughts on how to tag a block to specify that it's owned by a particular application? if this wire protocol is only intended for one-way communication that simplifies this away, but may be at the cost of effectiveness at tasks which require an external data source. if two way communication is an intended use case, then pkm apps could, for example, temporarily mount another pkm app as a page, with the other app's pages all appearing as sub-blocks of the local page (see also VFS apis); but that would work best with clear ownership, where the provider application only responds to update messages, and the application which is mounting the external data source would need to know that if it lost connection to the other application, changes within the mounted subtree are not possible until it reconnects to the canonical data source for those blocks.
continuing on that thread, what are your thoughts on specifying "please remind me what you currently think this subtree contains" requests?
Athens team, how does this protocol compare to the current RTC protocol? would it be difficult or easy to map the current rtc protocol to this proposal? what other concerns does the RTC protocol handle that this does not?

bshepherdson commented 3 years ago

Transport layer: no, I think that need not be standardized. I suspect most apps would use HTTP over TCP sockets, but I think there's no particular reason to specify that.
I hadn't been worrying too much about missing or reordered messages; I was viewing this at a database level of integrity - you've got bigger problems if these updates are being missed.
I thought hard about global IDs. "Everything gets a unique, permanent ID number" is a pretty solid rule, and if all the relevant tools were database-backed (like Roam and Athens) then I would just assign each block a UUID. The problem is the more textual tools (Logseq, org-mode) where there's not a great place to store it.
- Logseq can attach IDs to arbitrary blocks, but it's usually only done for the targets of block references, and requiring it to keep track (in order to understand incoming addresses) would be either clunky in the files or a major round-trip gap.
I have no particular thoughts on subtree requests. My main use-case here is Athens -> Markdown files (in Logseq style, probably) so I can check the files into git or Google Drive or something. Ideally I want to treat them as the source of truth, rather than Athens' database. Sharing the files or database across devices is part of this too.
I didn't really specify this fully, but I think the "read the current state of this block" would include (optionally?) recursing to all its children, which sounds like what you want?

To be clear, this is about storage and interop for a single user's instance, and doesn't directly interact with RTC.

filipesilva commented 3 years ago

Heya @shepheb, this is a really good write up of a lot of important stuff on this topic that wasn't really tightly packaged before! I took some notes below from reviewing it, apologies for it not being more structured.

KMA

First, let me name this at least in the scope of my comment, so it's easier to talk about it. I'll call it KMA, standing for Knowledge Management API.

I think there's an underlying expectation that individual *KM providers would interface with the KMA format somehow. I'm not sure this is realistic, insofar as it requires work on their part and that's an opportunity for coordination failure. If providers have an API from where you can extract the increment update information though, that's should be enough to convert to whatever KMA needs.

But even without their active participation you can still convert the backup formats into KMA format, and by diffing two KMA dbs you can derive the update set. I feel this is a good place to go: incremental updates as the first class citizen and being able to go down to backup conversion.

On addressing blocks

Not too keen on a page->index->index->index->... approach, as it effectively ties a block address to it's parents address and makes moving blocks around a nightmare. You did cover move on your spec but I felt it was underspecced, and it's one of several more complicated operations that are not primitives and need special handling.

It also means you cannot take a block in isolation outside the page (e.g. embeds, plain block refs) without exposing extra information about the enveloping structure on the caller side, which means the caller must also be updated when structure changes.

The indexes look like derived/cached data to me, where the real source of truth is the parent/child block relationships.

On block content

I think we can be pretty confident that a block has a given string content, and that the AST you propose is the result of parsing that string structure. This is pretty nice insofar as it does away from the need of having KMA parse stuff.

But I think it hides the true nature of what's going on: it's not so much that each app parses strings but moreso that each app has a way to derive graph information from those strings. For KMA it does not matter that the string is markdown so much as it matters that the block references addresses, presumably encoded on the string content. The encoding of these addresses is opaque and depends on the app parser. The extracted addresses themselves should be explicitly provided to KMA.

What I'm getting at is that the string content of a block is but one of its properties. I think the apps would here provide granular block updates instead of an ever-encompassing string/ast update. For instance, an app can update the content, the content and the referenced addresses, or other properties the block might have.

I don't feel confident that I know the totality of a blocks properties right now. But I feel very confident that string content and refs, instead of just string content, is the starting set and there might be more.

Block updates

Don't like the idea of dangling refs much, but I think it's the minimum common denominator. One way or another we're going to have broken refs, especially when considering cross-graph refs.

Wire format

Feel this bit is an unnecessary weak point, insofar as specifying the format is not necessary given the information needed for block updates is specified.

Couple of thoughts from my side

Temporality

I don't know if the KMA should encode this, but I feel strongly that any app with distributed storage must allow for optimistic updates while offline, and being able to at least view past states. As far as KMA is concerned I think this mostly means you should be able to at least reset its state.

But there might be more sophistication here whereby KMA allows you to go back to a certain point in time and fork from there or whatever. There's also some consideration here about to compare and resolve conflicts that's related to the concept of forking.

RTC common events

The Athens desktop app is event-driven, and as part of RTC we've been syncing some of these events with the backend. Your proposal that KMA could be the storage mechanism for *KM apps makes me think that each app backend would mostly handle communication, ref parsing, and integrity checks while KMA would be the defacto storage. I think this is an interesting idea.

How I've been thinking on a proto-KMA

I've been greatly inspired in The Web After Tomorrow when thinking about how these systems should work.

In my head, and to some extent encoded in the direction of Athens' RTC, I model this as a stream of atomic update events. I landed on this mental model because I care a lot about the temporality and synchronization characteristics of a KMA. Event-sourcing provides a good model for both of these things.

I also think the locality of changes is important when determining the atomic events, because a client should be able to effect changes without knowing the totality of the information. I understand this sounds unrealistic, but I believe it's essential for large data sets. This also circles back to synchronization, because no client is up to date - everyone is at least out of data by their latency.

Synchronization also leads to me value the ability to optimistically determine the effect of an update from its information. Optimistic updates are not just a matter of compensating for latency, but also a matter of offline support if you consider that offline is infinite latency. If you don't have access to the source-of-truth KMA you shouldn't be locked out of using your db. And you shouldn't need a full-fledged app with all available data to effect changes to your db either - simple mobile apps and API calls come to mind.

I think the foremost operation that is not local is renaming a page. This stems from the fact that a page can be addressed by its name/content. So updating the page name should change the string content of things that refer to it. Maybe this should be addressed by actually having a source-of-truth that can effect this change everywhere, or maybe we can just treat this as another can of possible dangling/mismatched refs. But either way I think stop-the-world changes should not be the norm. At worst they are a special case, and at best they don't exist at all.

One topic that I've mostly avoided is format. I mostly see format as a matter of content negotiation instead of canonicity. I think KMA should specify information and negotiate format. This negotiation should happen for both input events and for output views.

I'm not very confident about static views right now. I know there must be a way to get data out of KMA that isn't just "here's 10M events, make me a db". KMA must allow extracting local and scoped data without full history, and it must allow clients to keep only that subset of data up to date. Maybe this can be framed as a diff between static information states, or maybe it's always a full or partial replay, or maybe it's just static view polling. I don't quite know. But I think all the information needed to figure it out is in the updates.

So if I was putting down the main bits of KMA right now it'd look something like this:

KMA is an append-only log of atomic update events
you can query KMA for a subset of these events
atomic events change the information of a single block
known information for a block right now is:
- uid
- string content
- refs
- children (as refs and some way to order them)
block information is extensible
KMA clients are responsible for interpreting events and creating atomic update events

Which, looking at this list, is a pretty anemic model that kinda just says "there's a sequence of updates, you can't do macro stuff, and you need a client to make sense of it". But I don't think it's incorrect.

athensresearch / athens