a-type commented 1 month ago

After experimenting with segmented public/private namespaces with two different clients, some clear pain points come up:

Having to do lots of things twice. Like having two clients and two branches for CreateX type buttons.
Losing context on where an entity came from (partially helped by namespace field)
Losing context on which namespace an ID has corresponding entity in (what if it's both?)
Query hooks cannot be used in shared components meant to be used in any namespace as they're tied to one client context.

It's too much complexity. Maintaining it would be a pain.

Refocusing on the problem at hand:

I have a shopping list app. I want some lists to be private just to me, and others to be shared among everyone connected to my library.

But for most purposes, when in-app, these things all behave the same. A public item and a private item have the same behavior, except maybe your private list is marked with a label / icon, etc. But it would be nice to have everything come from the same queries.

What that means practically is... I want to prevent data related to certain documents from syncing to any replicas except ones owned by me.

One could imagine an addition to Verdant which can help the server mandate this...

Doc-level approach

When initing a new document, you can optionally include authorization metadata. Let's call this authn and it has a format u:<user_id>:* to leave some room for future expansion.
The server stores this metadata to disk associated with the document in a DocumentMetadata table keyed on document root OID.
Whenever any operations/baselines pass through the server, before rebroadcasting the server looks up any DocumentMetadata associated with the root OID and compares to the receivers. Only matching replicas get the data.

Notes

Looking up metadata for everything that passes through the server will be costly. Would need to be cached in memory, probably. Even then.
"Only matching replicas get the data" easier said than done for broadcasts. Today the library manager just fires the broadcast and forgets. Now there would need to be filtering closer to the edge. Would have to annotate each operation with the metadata individually to be matched against receiver right before send.

Op-level approach

When initing a new document, you can optionally include authorization metadata, as before.
When the client works with an entity, it readsthis init metadata and reattaches it to each subsequent operation. So EVERY outgoing op for the document has the attached authn metadata.
The server stores everything naively like today and rebroadcasts
Before broadcasting to a particular replica, the ops/baselines are filtered against the attached metadata for authn.

Notes

While adding extra data to every op is not efficient, it does really simplify the logic. And complexity / compute is probably more meaningful here than a small string addition to ops.

Problems with both approaches

This would officially create divergent collections between replicas. Which kind of invalidates #397 , but that was probably not going to work anyway, because of rebasing.
Still haven't solved the "duplicate OID" problem. What happens if you've got a private doc and someone makes one with the same ID that's public? With any metadata approach, the histories of both docs will merge.

Addendum... duplicate OIDs

Since I've just remembered this problem, ok, can we adapt?

The OIDs need to be different. Instead of metadata, we could encode the authn directly in the OID, as a new segment.

collection/<root id>::<rand>(::authn|base64)

base 64 encode so that subformatting can be used in the authn string without causing confusion.

Then authorized docs will have unique identities to other docs, but share identity if created simultaneously from different replicas. Like if you create identical things on phone and laptop while offline and sync up.

Changing permissions (private->public, etc) still requires full document clone to a new identity. But this is probably acceptable.

Because authn is encoded in OID, this necessarily goes to the op-based approach. OIDs will have to be parsed and decoded for filtering before transmitting to other replicas. Not quite as efficient as a simple filter. Efficiency can be clawed back a bit by pre-computing and attaching to the op, so that each individual replica socket connection doesn't have to parse. But this is only O(n) for n connected clients, which likely isn't a large n at any one time.

a-type commented 1 month ago

The local user problem

When using Verdant local-only, the replica has no determined User ID to utilize for authorization.

If running codepaths which create private documents, what value should be supplied for authorization subject? How does Verdant navigate the transition from local to synced for authorized documents?

Solution 1: special identifier

Since every Verdant library is initialized from 1 replica's data (initial data is never combined from multiple replicas, 1 'wins'), the identity of the original source replica's user can be determined.

When no sync identity is available, Verdant could substitute a special identifier like $originator. When syncing data, any operations or baselines with this authz subject are only synced to replicas belonging to the user who was the source of the library.

Solution 2: rewrite history upon bootstrapping new library on server

Similar to 1, except rather than storing the originator's identity, we rewrite OIDs including the special identifier to use the originator's id instead.

I don't like this as much as it violates history immutability and creates a difference from client to server (original replica will retain old OIDs unless reset).

Follow-up problem: determining user ID when offline even after sync

Even when syncing, presence data may not be available over the network. All replicas cannot use the special identifier, as only some of them actually belong to the source user. In other words, once synced, replicas MUST use their own user ID as subject.

We can (and maybe do) store the user ID persistently in local idb, so this should be fine.

Can this be leveraged to provide seamless behavior upgrading from local-only to sync? By default, user ID could be $originator in storage, and only overwritten to a real ID upon library sync. I think that would work, and also require no special logic to insert $originator into authz while local-only.

a-type commented 1 month ago

Problem: querying by ID

If additional data is encoded in the root OID, this breaks expectations about format and querying from userland. Currently a document is found by collection + id, which is formatted into a root OID for lookup. Now the user would also have to specify access as part of the parameters to retrieve a single object.

This breaks / complicates two main usages:

Documents with a well-known ID, like default
Document relationships, where an ID is stored on one document in reference to another

The second one is perhaps the trickiest.

Suppose I have a list item which refers to its parent list. The parent list's ID is list-a and it's a private list. The item can't just do listId: 'list-a', because that won't be specific enough. There could be a list-a that's public, too. The ID alone is no longer a unique identifier.

Approach 1: Embrace 'foreign refs' as a concept

Rather than using string fields to store IDs, create a new field type which acts as a ref to another document and encodes its access in that reference.

Problem: now instead of just using an ID string to set this field, presumably you need a copy of the actual document you want to attach, like

client.items.put({
  listRef: parentList,
  content: ''
});

This might not always be convenient.

Approach 2: Change ID-based queries to match any permissions, and warn users about conflicting IDs at different permission levels

Instead of keying index document lookup on raw OID, strip the permissions before storing the indexes. This lets you continue looking up any doc by just collection + id, even though the resulting doc will come back with a permissioned OID.

This would mean that having a doc ID that conflicts with a different permission level would result in undefined behavior / corrupted docs. The documentation would simply warn about this when using custom ID values. The built-in random ID shouldn't conflict anyway.

So the rules here would be... if you want to have multiple permission groups which have the same well-known ID, you must prefix/suffix that in some way. This would be kind of hard/impossible in a migration context, but that's pretty edge casey. Maybe there's a way to make that work? Maybe the first replica of each user can generate a 'global random value' to seed these things and sync that to other replicas for that user when syncing for the first time...

TODO: Does changing the index key present any new challenges?

a-type commented 1 month ago

That last point gives me a different idea for the authz subject... each replica could generate a global random value for itself and use this as the subject Id instead of a server-controlled user ID. Then I wouldn't need a special 'originator' value, just use the global random value.

However, this limits the ability of any future authz extension where the user grants permission to specific peers by their server IDs. You'd have to know their global random values.

a-type commented 1 month ago

Full circle...

Ok, so originally I wanted to encode the authz into the OID because of the 'duplicate OID problem' which arises if there's docs of different access levels with the same ID...

and then a few thoughts later I concluded that even with that, you can't allow using the same ID. Which obviates the need to encode in the OID (which created complications for querying / references).

So, ok, back to... add authz metadata to the operations themselves as another key.

No more need to decode OIDs before transmitting. Just read the authz key, if present.

No specialized index key / querying constraints.

So far this seems like the simplest path yet.

a-type commented 1 month ago

Status update.

Implemented with the following:

Authz is attached to every operation / baseline if present for a document
Document notes its initialized authz (based on baseline / latest init) and replicates that to all outgoing ops
Local-only replicas use an 'originator' constant as authz subject
Upon init of new library to server sync, server rewrites 'originator' to userId of the replica providing library
Before sending any messages, server compares authz of all ops/baselines to receiver userId and filters

Things to do:

[x] Make tests pass
[x] ⚠️ Edge case: new replica inits library, but queues up new authz'd operations while init is inflight. Right now server would not rewrite originator subject. These ops would not be synced to another replica for that user.

a-type commented 1 month ago

More things to work out:

[x] Applying authz to migration operations
[x] Applying authz on delete operations

a-type commented 2 weeks ago

Authz has shipped for controlling private document access. The model theoretically supports specific user authorization and role/group based authorization, to be explored later.

a-type / verdant

Revisit authorization approaches #399

Doc-level approach

Notes

Op-level approach

Notes

Problems with both approaches

Addendum... duplicate OIDs

The local user problem

Solution 1: special identifier

Solution 2: rewrite history upon bootstrapping new library on server

Follow-up problem: determining user ID when offline even after sync

Problem: querying by ID

Approach 1: Embrace 'foreign refs' as a concept

Approach 2: Change ID-based queries to match any permissions, and warn users about conflicting IDs at different permission levels

Full circle...