braid-org / braid-spec

Working area for Braid extensions to HTTP
https://braid.org
233 stars 16 forks source link

Where to specify a data structure's merge-type schema? #11

Closed toomim closed 3 years ago

toomim commented 4 years ago

Duane writes:

Something that I don't currently understand about Braid is where one would be able to choose a way of merging data on a more fine-grained level that corresponds to intentions.

For example, suppose I have two situations in my application that each involve merging strings:

  1. A collaborative textarea where customer service reps take notes.
  2. An "address" field where customer service reps enter the customer's address.

These two fields' merge intentions are quite different. In the first case, it makes sense to keep all edits, inserting divergent text so that the result includes all text. In the second case, it would be confusing if the address field resulted in a sort of "hybrid" address with a street name from one rep's edit and a house number from another rep's edit. It would be better to have a "last write wins" merge here.

My understanding of sync9, or any other merge algorithm, is that it chooses one or the other merge intention for us--there is no way to specify that one field should behave in the first way, and another field should behave in the second way. Is that correct?

If so, where would this intention-mapping behavior belong?

Thank you, this is a pretty glaring omission in the current spec! In the last spec, we suggested that a programmer could specify a schema of different merge types, but we haven't gotten to defining the precise syntax for this yet.

I think this is important and will be working on it. It's possible that we'll want to encode the merge-type of a JSON value inside the Linked-JSON spec, so that you could do something like {merge_type: "lww", val: "P. Sherman, 42 Wallaby Way, Sidney"}. We'd then have to add "merge_type" as a special keyword that needs to be escaped, like "link".

Also, a more precise example for using last-write-wins might be a UUID or computer hash address, rather than a physical address. It's more inconceivable that two people want to edit a hash address and merge their edits than a physical mailing address.

mitar commented 4 years ago

Yes, I brought up this in the past as well on the mailing list.

I think this is a big chunk on its own. I think we should be careful here to not redo work here. I think it might be more useful that for complex data structure we piggy-back on existing languages to define schemas, like JSON schema or JSON-LD, and not create our own.

mitar commented 4 years ago

My suggestion would be that we specify that if there is a HTTP header which provides synchronization type, then that type applies to the whole structure. If you want something else, then this should be provided through other means (like inline metadata, linked JSON with extension, whatever), but not through the HTTP header. Of course that other means can use synchronization type names when constructing the more complicated description.

toomim commented 4 years ago

Yes, I'm currently thinking that we could add metadata to Linked JSON that says "this subtree of JSON has Synchronization Type X."

mitar commented 4 years ago

I do not really like Linked JSON, so I do not really care. I think you will eventually end up with JSON-LD, just not JSON-LD. :-)

mitar commented 4 years ago

Do you want to do this for this iteration or later? If later, add later label to this issue.

toomim commented 4 years ago

I think I want it in this one. I think this has a use-case that is different than what you're thinking with JSON-LD.

toomim commented 4 years ago

I also know that a lot of people hate JSON-LD because it has RDF in it, and people view that as poison and avoid everything RDF. And I kinda agree that we absolutely don't need RDF. We don't even need a schema. We just need links, and the ability to specify the merge-types of resources and ranges within sub-resources.

toomim commented 4 years ago

I talked on the phone with @brynbellomy a while ago and came up with an idea:

Bryn and I are thinking of doing this in the NelSON spec. You can think of NelSON as a HyperMedia version of JSON — something Ted Nelson would approve of. It lets you specify hypermedia datatypes, including links and URLs, Content-Types, and Merge-Types, within any slice of JSON.

So for instance, if you wanted to specify that a bank account balance should merge as a counter, you might use JSON like this:

{
   account: "checking",
   balance: {
      "Content-Type": "application/nelson",
      "Merge-Type": "counter",
      "body": 3
   }
}

The field Content-Type makes an object special, and needs to be escaped as _Content-Type if you actually want a field to be named "Content-Type" within a JSON object.

You could also use this scheme to define custom internal datatypes for your application. For instance, it's nice to have a special link datatype, and a date datatype in an app. That way, when you receive some JSON with a date encoded in a string, it can automatically be turned into a Javascript Date object for you. Similarly, if you receive JSON with a Link in a string, it can automatically convert between relative and absolute URLs for you. You could define these datatypes in your JSON like this:

{
   username: "duane-the-rock",
   profile_pic: {"Content-Type": "link", "body": "/img/duane.jpg"},
   last_updated: {"Content-Type": "date", "body": 1574892803398}
}

You could also imagine creating arbitrary custom datatypes this way, like "Photoshop Layer", all backed by JSON.

The idea is that any Content-Type that starts with application/*, audio/*, font/*, image/*, text/*, etc. refers to a standard IANA media type, but all other strings — e.g. "date" or "link" — are open for an application developer to define as a custom datatype internal to their own application.

Finally, you could implement transclusion with a link, doing something like this:

{
   message: "Hi mom!",
   author_image: {
      "Content-Type": "link",
      "body": "/user/duane-the-rock",
      "Transclude": true,
      "Content-Slice": ".image",
      "Version": "6uzinzko4hk"      // You can even specify a particular version to transclude
   }
}

What do y'all think? (And thanks @canadaduane for articulating the problem statement.)

pkulchenko commented 3 years ago

@toomim, @brynbellomy, has there been any further work on this? It looks like NelSON spec has been replaced by the linkedJSON spec, but I don't see anything about (embedded) Merge-Type proposal in that spec.

toomim commented 3 years ago

Thanks for the question! I know it's been confusing. Let me summarize my current understanding, and then let's look at whether there's anything missing for you.

The current spec allows you to specify a Merge-Type on any resource:

Request:
   GET /foo

Response:
   HTTP/1.1 200 OK
   Merge-Type: automerge

   {"foo" : 3}

This is good, but you also might want to specify a different Merge-Type on an internal piece of the resource, like maybe say that the number 3 is a counter.

You can do that with Linked JSON, by making that internal piece its own resource, and then linking to it from the outer resource:

Request:
   GET /foo

Response:
   HTTP/1.1 200 OK
   Merge-Type: automerge

   {"foo" : {"link" "/bar"}}

Request:
   GET /bar

Response:
   HTTP/1.1 200 OK
   Merge-Type: counter

   3

This solves the problem by defining two resources, and giving each a Merge-Type.

In general, Linked JSON provides the complete power of hypermedia resources (arbitrary content-types, merge-types, URLs, subscriptions) within JSON data structures. However, it does require the programmer to define more hypermedia resources in order to get those features.

Perhaps the most obvious problem with the above example is that a single Request/Response has been split into two! However, we can actually get both resources in a single request using Range Queries, which combine with Linked JSON to give programmers the power of GraphQL, but directly within HTTP. (I think this is really cool on its own, btw!)

But you might also be annoyed by the need to define a new resource at all, with its own URI. Why should you have to come up with a whole entire URI for the number 3? To solve this, you might want to have something like an anonymous link:

Request:
   GET /foo

Response:
   HTTP/1.1 200 OK
   Merge-Type: automerge

   {"foo": {
      "link": null,
      "Merge-Type": "counter",
      "body": "3" }}

This lets us specify an anonymous hypermedia resource, with all its power, with a special JSON encoding that exists only within JSON.

This last feature is what we tried to define in the NelSON spec, but I didn't see much interest in it (e.g. nobody responded to my comment above), and I never felt completely confident in the syntax, but do feel confident in the basic features of Linked JSON, so last November I rescinded the proposal for NelSON and replaced it with the original, simpler, clearer, Linked JSON specification. I was concerned that the design decisions involved in fully embedding hypermedia resources into JSON would be hard to find consensus on, that we don't have any implementations that even need these features yet. But we do have implementations that need Linked JSON, and need to be able to find consensus on a way to represent links in JSON right now.

I also want to point out that Linked JSON is explicitly forward-compatible with the features we might want for NelSON, by allowing metadata to be specified on each link:

1.3.  Metadata

   Links MAY specify metadata on other fields of the JSON object.

   For instance, a "version" field could be used to specify a specific
   version to link to:

     {
       "message": "Hey guys! I just published a new draft!",
       "attachment: {
         "link": "/books/the-way-things-work",
         "version": "4.0.5"
       }
     }

   The metadata in [RFC8288] could also be expressed.

   Or a GraphQL [GRAPHQL] query:

     { "link": "/foo", "range": "(bar:9)[3,4]" }

This allows one to specify NelSON-type metadata on top of links if one so desires, but it doesn't attempt to declare a standard way of doing so, and doesn't define what happens when a link has a value of null.

Thus, Linked JSON is forward-compatible with any potential definition of NelSON that we might come up with, but it's not trying to define that just yet. If you're interested in doing the word to standardize such a feature, then please say so, and let's re-open this issue and get work!

pkulchenko commented 3 years ago

@toomim, thank you for the detailed response!

Both of the options you suggested (a response with Merge-Type and a response with a link to a different resource with its own MergeType) should work.

However, we can actually get both resources in a single request using Range Queries, which combine with Linked JSON to give programmers the power of GraphQL, but directly within HTTP. (I think this is really cool on its own, btw!)

I get an empty page for the Range Queries link (https://wiki.invisible.college/braid/protocol#range-requests-emulate-graphql); is the URL correct? I was interested to check how your proposal may eliminate the need to send multiple requests...

The anonymous link proposal is an interesting one as well and may work better for my use case. I'm envisioning a case where instead of updating individual fields a user may update something like a DB record, which is likely to combine individual fields of different types (with different Merge-Type algorithms to resolve conflicts). These can definitely be updated with multiple requests or a single request with multiple patch elements, but they can also be updated with one patch that includes a JSON element that encodes different Merge-Type values (using the anonymous link syntax).

Having written it all out though, I'm now thinking that the multi-patch mechanism may be a better one, as it already provides a way to encode multiple fragments with their own HTTP headers. So, in this case I'd encode all the elements that require one merge algorithm as one patch and then would have individual patches for each elements that requires a separate merge type, right?

toomim commented 3 years ago

I get an empty page for the Range Queries link; is the URL correct?

Ah sorry. It works for me, but here's a copy for you: https://braid.org/protocol/range-queries

toomim commented 3 years ago

Could you give a concrete example (e.g. with some example code or data structures) to help me understand your use case and what options you are considering to meet it? I don't fully understand the distinction you are articulating between updating a "DB record" vs. "individual fields", nor what you mean by "the multi-patch mechanism", and how that relates to merge-types.

pkulchenko commented 3 years ago

I was thinking about a case when several fields may need to be updated together either due to a constraint (or a cross-dependency) or being updated as a result of a batch process when the fields are updated together, so they would need to be sent as one message (although not necessarily as one patch). After reading the discussions on the Braid maillist, I don't think it's going to be an issue, so I'm going to withdraw my question about handling embedded merge-types in one json message.

What I'm curious about now is whether the semantics of the following cases are the same (for the receiving node):

  1. One request with one patch with one JSON payload with two fields updated
  2. One request with two patches with two JSON payloads updating individual fields
  3. Two requests with one patch in each updating individual fields

I'm not sure if all three are the same or if 1 (and possibly 2) would be handled differently from 3 given that there may be a gap between messages.

toomim commented 3 years ago

Ah, it sounds like your underlying question is "What is the unit of atomic update?" Is that correct?

The unit of atomic update is the version. It is not the request, or a patch, or a JSON payload. The number of requests, patches, and JSON payloads does not matter. Thus, I cannot tell you whether the semantics of (1), (2) and (3) above are the same or not, because you have not specified the version headers for each example.

To illustrate, let's return to your original case of having several things that must update together. Perhaps we have two resources, /foo and /bar:

Request:
   GET /foo

Response:
   HTTP/1.1 200 OK
   Version: "42"

   3
Request:
   GET /bar

Response:
   HTTP/1.1 200 OK
   Version: "42"

   6

The value of /foo is 3, and /bar is 6. But both of them have the same version "42". An application might give them the same version to say that they updated at the same time.

Now let's imagine that you change /foo to 4:

Request:
   PUT /foo

   4

Response:
   HTTP/1.1 200 OK
   Version: "43"

What might we see if we GET /bar now? Let's find out:

Request:
   GET /bar

Response:
   HTTP/1.1 200 OK
   Version: "43"

   8

Oh, look at that! /bar has now updated to 8, and also has version "43"! This application is expressing the semantics that /foo and /bar are linked together, and updated at the same time atomically. All we needed to do was update the version header for both resources to the same value. That means that those values exist at the same point in time.

Now, just because you can represent multiple resources this way, with linked versions, doesn't mean that you have to; and just because you find two resources in the wild that happen to have the same version IDs every once in a while it doesn't mean that the programmers necessarily intended them to express an atomic update. It's up to every application designer to choose whether they need atomic updates to be representable across resources, and to do so if she wishes.

In the future, I anticipate that we might want to standardize some method of distinguishing which resources share timelines, so that we can programmatically determine which ones are supposed to have atomically-linked updates. I think this will mostly be important when validating requests. For instance, if you want to do a bank account transaction across bank accounts on two different servers, you might want to debit one account from one server and credit the account on another server, and you might only want to validate either transaction if the corresponding transaction exists on the other server, timestamped with the same version ID. That validation function will want to wait until both transactions exist before it accepts either transaction as valid.

Does this address your question?

pkulchenko commented 3 years ago

Yes, this answers my question; thank you for the detailed examples! The bank account transaction example is an interesting one and is close to the one I had in mind.

Overall, I think this approach makes total sense, as the receiving peer can't rely on the fact that patches are present in one request, especially in light of them potentially repackaged for further distribution and having the version number to track the relationship should work.