`Client` header to prevent echoes from server

toomim commented 3 years ago

If a client does:

GET /foo subscribe
PUT /foo "hello"

Then a naive server might send the client's PUT (2) back to it, because it subscribed in (1). We call these duplicate versions "echoes" from the server.

In addition to using unnecessary bandwidth, echoes cause problems for any client without a good CRDT/OT:

If the PUT sent a patch, it will get applied a second time when it echoes back. So if the user types "d", she might see "dd" in her text editor.
If the PUT sent a full body, it will clobber any subsequent edits when it echoes back. So the user might type "dogs", but then the "d" and the "do" will echo back, and she'll see her text edits unwind. If this happens when she types "s", the client might send "dos", which would lose the "g" forever.

Thus, it's nicer if we can prevent echoes from happening.

A simple solution is for the client to specify a Client: "c82ysjd3" header on every request, and then the server can refrain from sending any version back to the same client that created it.

I've implemented this in my client/server, but we haven't discussed it at all in the spec. I think we should add this Client header to the braid-http spec.

toomim commented 3 years ago

I'm not 100% sure if Client: is the right header to use, though. Here are a couple doubts I have:

Perhaps we could re-use cookies in some fancy way, rather than creating a new header.
Or perhaps we want to track connections rather than clients.

Consider that if a client has two tabs open, she'll probably want edits to propagate from one tab over to the other through the server. So she probably wants to limit echoes just per tab, and since she has one connection per tab, she could limit them per connection.

However, in the long run, it's even better if all tabs in a browser can share a single connection. (We have been implementing this in the lab, using cross-tab postMessage() and/or serviceWorkers.) Then each browser will need only one connection to each server, no matter how many tabs it has open. In this case, we do want block echoes per-client.

I'm starting to think that we should just define a new concept here to mean exactly what we want: A Peer could be defined as a computer somewhere with its own memory that can send and receive updates. Then each tab can be its own "peer" if it creates its own connection, but an entire browser could also be a peer if aggregates all updates across its tabs on its own.

Giving each peer an ID is also useful when we move to P2P networking (in the much longer-term future). There are a number of use-cases where you want each peer to have an ID. So it might be appropriate to start with a Peer: header right now, because it solves this immediate problem of echoes, and also a number of problems we seem to be running into in the future.

katuma commented 3 years ago

Servers can be a cluster that are state-oblivious to each other, except for the actual document replication. Meaning client is subscribed to server1 and posts PUTs on server2. The servers will sync up on the document via other means, yet the client will see the echo regardless. Same thing in P2P.

One has to bake the client identity in the messages themselves, and there's already way to do that via client-specified Version:. Potential downside of that is that clients need to invent their custom encoding in there, potential upside is that they can publish anonymously if they use fancy commitment cryptography to do that.

mitar commented 3 years ago

My suggestion would be that:

Subscriptions do not send data, but only URLs of new versions of data available. Client then fetches those using regular GET to obtain versions it might not yet have. If it already has it (because it originated the change in the first place, it does not have to fetch anything).
Response of PUT is URL of the new version of data available (as Location header). Again, the client can decide to fetch it or not.

I think this aligns well with HTTP2 then: server can per-emptively push the URL to the client, too. And client can then abort the push if it is unnecessary.

I think it also aligns well with caching. A browser might have those versions available in cache already, so syncing between tabs would not really consume bandwidth multiple times, only once. But semantics exposed to the client (code inside a tab) is simple and does not have to deal with deduplication across tabs, the browser does.

I would suggest we do not have a header like Client because this makes everything stateful. Otherwise just subscriptions are stateful, no?

katuma commented 3 years ago

@mitar The offer/fetch doubles round trips. Header for size threshold?

mitar commented 3 years ago

The offer/fetch doubles round trips.

Not with HTTP2, but that is abstracted away from the client.

katuma commented 3 years ago

@mitar It does regardless of protocol.

S -> C (I have)
C -> S (Give me)

vs

S -> C (here you go whether you wan't it or not)

mitar commented 3 years ago

Please read:

katuma commented 3 years ago

@mitar This has nothing to do with http. The client needs to know first what to request in the first place, hence introducing the latency.

mitar commented 3 years ago

The client needs to know first what to request in the first place, hence introducing the latency.

With HTTP2 the server can proactively push the second response together with the first response. See vulcan which uses that effectively.

toomim commented 3 years ago

@katuma wrote:

Servers can be a cluster that are state-oblivious to each other, except for the actual document replication. Meaning client is subscribed to server1 and posts PUTs on server2. The servers will sync up on the document via other means, yet the client will see the echo regardless. Same thing in P2P.

One has to bake the client identity in the messages themselves, and there's already way to do that via client-specified Version:. Potential downside of that is that clients need to invent their custom encoding in there, potential upside is that they can publish anonymously if they use fancy commitment cryptography to do that.

Oh, you're referring to the case where a client or peer implements a version history and thus can deduce that an incoming version is one it has already seen. But the motivating use-case of this spec is for clients without a full version history— that is what I meant in the OP when I said this is for clients "without a good CRDT/OT." Perhaps it would have been clearer to say "without versioned messages and a robust version history."

This header's motivation is for implementing simple apps, like if you have an existing web app and want to add Braid support just for Subscriptions (see slide 3 here: https://braid.news/files/http2p2p.pdf), but haven't implemented full support for Mutations yet.

Perhaps it would be most clear to say that this peer: or client: header is useful in these situations:

A simpler client/server resource
That hasn't implemented full versioning
Or even if it has, wants to reduce bandwidth by not sending duplicate versions back to the client

When I said the peer header could be useful for some P2P situations, I was only speculating. We have a prototype of P2P acknowledgements (slide ~14 above) that gives each peer an ID, but there's no specification for it yet.

Now to address @mitar's suggestion:

My suggestion would be that:

Subscriptions do not send data, but only URLs of new versions of data available. Client then fetches those using regular GET to obtain versions it might not yet have. If it already has it (because it originated the change in the first place, it does not have to fetch anything).

Response of PUT is URL of the new version of data available (as Location header). Again, the client can decide to fetch it or not.

I think this aligns well with HTTP2 then: server can per-emptively push the URL to the client, too. And client can then abort the push if it is unnecessary.

This suggestion doesn't prevent the echoes, it just wraps them up in HTTP/2 server-push semantics, and that doesn't address the goals stated in the OP: (1) to reduce bandwidth, and (2) simplify the implementation of clients. Server-push uses more bandwidth (since the server optimistically sends stuff), and is complex to implement.

I would suggest we do not have a header like Client because this makes everything stateful. Otherwise just subscriptions are stateful, no?

Subscriptions are already stateful in Braid. The server has to remember the state of each client subscription in order to send them the right stuff. This is an intrinsic tradeoff in synchronization.

In general, Braid changes these dimensions of HTTP:

Stateless: Braid introduces enough state for synchronization. Servers remember subscriptions. Peers remember some history.
Client/Server: Braid introduces peer-to-peer networking
Request/Response: Braid introduces generalized p2p messaging (see e.g. my concluding talk at https://braid.news/meeting-2)

If you want HTTP to remain stateless, then you can't have subscriptions at all. What we can do in Braid is constrain that state to as little as possible.

mitar commented 3 years ago

This suggestion doesn't prevent the echoes, it just wraps them up in HTTP/2 server-push semantics, and that doesn't address the goals stated in the OP: (1) to reduce bandwidth, and (2) simplify the implementation of clients. Server-push uses more bandwidth (since the server optimistically sends stuff), and is complex to implement.

What do you mean? Browsers already have it implemented. For apps on top of browsers it is completely abstracted out. They just simply do GET /url/from/subscription and magically immediately get the response.

to reduce bandwidth

From my experience with Meteor and ecosystem there, I must warn you that you will never ever be able to have only one setting here. People will always have slightly different use cases here where they will want to optimize slightly different things. For example, Meteor syncs state based on top-level field equality. So if anything changes deeper inside, the whole top-level field is modified. For this, it has to keep state based on top-level fields on the server. This reduces some bandwidth use, but not all. Now, I have seen many users wanting a) full diffing, deep fields, to minimize bandwidth usage b) no diffing and always sending whole objects, to minimize state usage on the server. For some cases, where you have large and deeply nested objects a) makes more sense. For cases (like chat apps) where you have small objects but many of them, b) makes more sense (you just have to remember which objects you sent, but not their content at all).

So I think we should not pre-decide this. We should have a way that this can be customizable.

Server-push uses more bandwidth

Yes, slightly more because the response header goes over the wire, before the client can cancel it (once it figures out from the header that is already has the response). But that is still much less then the whole response in general. Especially with HTTP2 header compression.

Subscriptions are already stateful in Braid. The server has to remember the state of each client subscription in order to send them the right stuff. This is an intrinsic tradeoff in synchronization.

That's what I wrote. Let's keep only subscriptions stateful. Otherwise whole our hope of using HTTP and caching goes out of the window. And then we could just simply implement the whole thing over web sockets.

So non-subscription Braid requests should be stateless in my opinion.

toomim commented 3 years ago

Server-push uses more bandwidth (since the server optimistically sends stuff), and is complex to implement.

What do you mean? Browsers already have it implemented. For apps on top of browsers it is completely abstracted out. They just simply do GET /url/from/subscription and magically immediately get the response.

It's important to be easy to write new implementations
Chrome has been wanting to remove the implementation because it's too complex
It doesn't solve the bandwidth problem. Servers will still send clients information that they created themselves.

So I think we should not pre-decide this. We should have a way that this can be customizable.

We are not pre-deciding anything. We are proposing an optional spec that people can opt into if they want. Other ways of doing things can be written up as well.

Server-push uses more bandwidth

Yes, slightly more because the response header goes over the wire, before the client can cancel it (once it figures out from the header that is already has the response). But that is still much less then the whole response in general. Especially with HTTP2 header compression.

The response body gets sent too.

Subscriptions are already stateful in Braid. The server has to remember the state of each client subscription in order to send them the right stuff. This is an intrinsic tradeoff in synchronization.

That's what I wrote. Let's keep only subscriptions stateful. Otherwise whole our hope of using HTTP and caching goes out of the window. And then we could just simply implement the whole thing over web sockets.

So non-subscription Braid requests should be stateless in my opinion.

This is already limited to subscriptions. This whole thread is only about subscriptions. Nothing in this proposal affects non-subscription requests.

mitar commented 3 years ago

It's important to be easy to write new implementations

Yea, but they might be slightly sub-optimal. But what is easiest than to:

get a URL from subscribe
ignore early hints header or not use HTTP2 at all
check in local cache if you already have URL
if not, GET it

Chrome has been wanting to remove the implementation because it's too complex

Sure, but 103 Early Hint header can be used as well, not much difference here between that and push here.

The response body gets sent too.

But you can terminate that before it really clears the buffer on the server.

This is already limited to subscriptions. This whole thread is only about subscriptions. Nothing in this proposal affects non-subscription requests.

Oh, maybe this is the source of misunderstanding. I thought we are optimizing not duplicating between subscriptions and all other calls (GET, PUT, etc.). So that if you made GET and have a version of state and then subscribe, that you do no get that state you already have again. Maybe I generalized the problem too much (to "I have state X already, I do not want to get it again; no matter how I got it, by doing GET or because I authored in the first place"). You are saying the issue is only about PUTs and not wanting to get back your own state only.

Even ProseMirror had to have that, this is why you have client ID in there so that it can differentiate between changes coming back which originated by the client itself (it does not have assumptions about the transport, so client's changes might be echoed back from the server).

i would suggest that the semantic of peer ID header is then really that that is the author of the change being PUTed. While in subscribe call we would have another header which allows you to filter which changes you want to not receive, so for which list of peer IDs. I think separating those and making it explicit that subscribe header is simply just for filtering and not to identify the peer is important for P2P setting.

BTW, what are plan to deduplicate changes you are maybe getting over multiple subscribe connections in P2P setting? Maybe this here is just a special type of that? Because you could see it like that you have a loopback connection from you to yourself, and then from you to the server, and this is why you are getting duplicated changes.

toomim commented 3 years ago

Oh, maybe this is the source of misunderstanding. I thought we are optimizing not duplicating between subscriptions and all other calls. ... Maybe I generalized the problem too much (to "I have state X already, I do not want to get it again; no matter how I got it, by doing GET or because I authored in the first place").

Ah, that does sound like the misunderstanding. Thanks for finding it. This is not about the general case, but it would be nice if it weren't incompatible with any solutions we might come up with for the general case.

i would suggest that the semantic of peer ID header is then really that that is the author of the change being PUTed. While in subscribe call we would have another header which allows you to filter which changes you want to not receive, so for which list of peer IDs.

That fits the description of this proposal, which adds a Peer: header to both GET+Subscribe, and PUT requests. The server can use this to skip sending back versions that the Peer sent it.

Since we don't have any P2P specs yet, I think we can put off the discussion of how to make this work perfectly in a P2P mesh network.

BTW, what are plan to deduplicate changes you are maybe getting over multiple subscribe connections in P2P setting?

In general, I suspect it's impossibly to fully deduplicate. Imagine we have this network:

    a      | data is flowing down from a to d
   / \     |
  b   c    |
   \ /     |
    d      V

Data from (a) to (d) gets routed through two paths. I imagine those two paths will both end up reaching (d), and it'll get duplicate data.

I'm not an expert in routing algorithms, and maybe there's a known way to solve this without requiring a central coordinator to plan the network, but it certainly seems like a hard problem, and one that we don't need to take on yet.

So I'm not concerned about deduplicating all network requests in general. One we have complex P2P networks, we'll be using great algorithms with version histories that don't suffer from the two problems I pointed out at the beginning of this issue. The issue we're focusing on here is that dumb clients will see duplicate characters or clobbered text fields. We don't need to solve everything else yet, but we would like any network protocol we're standardizing to be forward compatible with solutions we might come up with for those.

mitar commented 3 years ago

That fits the description of this proposal, which adds a Peer: header to both GET+Subscribe, and PUT requests. The server can use this to skip sending back versions that the Peer sent it.

My proposal is slightly different, that the subscribe does not have Peer header, but something like Skip-Peers header.

So why are we then even trying to optimize this edge case of general deduplication at this time? If our intuition is that it might be hard or impossible to deduplicate on P2P networks at all, so some bandwidth will go to waste, why not simply be OK with some bandwidth going to waste also in this edge case? Is maybe this here now premature optimization?

The issue we're focusing on here is that dumb clients will see duplicate characters

They just have to remember which versions they have already seen? Do they really have to know also peers?

toomim commented 3 years ago

My proposal is slightly different, that the subscribe does not have Peer header, but something like Skip-Peers header.

I think Peer and Skip-Peers are different names for the same thing.

So why are we then even trying to optimize this edge case of general deduplication at this time? If our intuition is that it might be hard or impossible to deduplicate on P2P networks at all, so some bandwidth will go to waste, why not simply be OK with some bandwidth going to waste also in this edge case?

It's not about bandwidth. It's about duplicate characters and clobbered text fields. I've said this twice already. C-f on this page for "clobber". Re-read the OP please.

These are real issues that came up in our implementations. We need a solution. This is not theoretical.

mitar commented 3 years ago

These are real issues that came up in our implementations. We need a solution. This is not theoretical.

Doesn't every patch have an unique version associated with it? I am not sure why would implementation apply same patch for same version twice?

toomim commented 3 years ago

Doesn't every patch have an unique version associated with it? I am not sure why would implementation apply same patch for same version twice?

I already answered this question above. Please C-f for "Oh, you're referring to the case where a client or peer implements a version history".

mitar commented 3 years ago

But there is big difference between "full version history" and "set of seen versions" in how hard is to implement each.

toomim commented 3 years ago

Sure. The specific phrasing I used was "without versioned messages and a robust version history," which I think is accurate.

mitar commented 3 years ago

I think Peer and Skip-Peers are different names for the same thing.

The later is a list. Which can also be an empty list if you do maintain robust version history. :-)

josephg commented 3 years ago

Discussion at meeting: Punt for 04. There are some other issues with OT and this proposal.

mitar commented 2 years ago

I reread this again and I realized that this issue is really just about optimizing that client's own action does not come back to client itself over a subscription.

It is interesting because I see that as a feature and this is maybe because both Meteor and 3 factor app (popularized by Hasura) are in fact using this client-own-update to get information that update is persisted on the client. So you push change to the server and when you get it back you know that the change has been accepted, and you stop waiting. It allows you in fact to go further and when you push the change to the server you forget about it (and do not update UI) until you get it back from the server, only then you update UI. So your code to update UI is always the same: with changes from the server. Client just generates changes to send to the server.

I know this is a very different use case than Braid's where we primarily talk about peers and not central servers which would approve/deny updates. But it is maybe to think that there are use cases where getting these updates back to the client are beneficial.

braid-org / braid-spec

`Client` header to prevent echoes from server #72