deepstreamIO / deepstream.io

deepstream.io server
https://deepstreamio.github.io
MIT License
7.13k stars 382 forks source link

Client acknowledgement of success/failure #458

Closed rileylark closed 7 years ago

rileylark commented 7 years ago

In https://github.com/deepstreamIO/deepstream.io/issues/195#issuecomment-263007723 @ronag mentions that

Clients are never notified of write success.

For our use case it's important that clients can depend on their writes succeeding. It's fine for us if "succeeding" only means "will definitely be permanently stored at some point" but it's not OK for us to assume writes have succeeded when they may not.

Is deepstream still a match for our case? How hard would it be for us to add behavior like notifying clients when their writes have failed to make it into durable storage?

WolframHempel commented 7 years ago

It's a requirement for all our users and of course something that deepstream takes care of. Generally, Roland is right when saying "Clients are never notified of write success." - but that doesn't mean there are no write guarantees. Deepstream employs an eventual consistency model with negative acknowledgement. Every write / record state is assigned a version number. If a version number is skipped, the client automatically reconciliates. If two versions clash a merge function is triggered. Errors while writing to either the cache or the database are relayed back to the client that executed the write. If the deepstream-server goes down in the exact time a write is underway and later reconnects the version conflict will be detected.

rileylark commented 7 years ago

Ok, this sounds great! I don't understand this particular case yet, though:

Deepstream starts. Failed writes are not persisted to database however can still be read from cache if client requests it. If Redis dies or restarts the data is lost.

At what point is the client notified that the data is lost?

WolframHempel commented 7 years ago

Good question: Redis saves its data to disk as well and thus can serve as a secondary backup to the database. Other caches like Memcached or Hazelcast however are in-memory only and loose data as soon as they shut down.

If the write to the cache succeeds, but the write to the database fails the client will receive either a RECORD_CREATE_ERROR for new records or a RECORD_UPDATE_ERROR for existing records.

If deepstream fails the exact millisecond after a write to the cache, but during a write to the database that also fails, this data will be in cache only and thus an inconsistency will occur.

rileylark commented 7 years ago

Ok, so it's that inconsistency I'm trying to narrow in on - the case where DS successfully writes to the cache and then crashes, so now the cache & database disagree.

Am I overreacting? Is this an insignificant window for failure?

Maybe, for our specific case, we could alter the persistence logic from:

  1. Write cache
  2. Write DB

to

  1. Post write request to pubsub with guaranteed delivery (if this fails send RECORD_UDPATE_ERROR)
  2. pubsub causes actual writes via two subscriptions 2a. Message guaranteed to be delivered to cache (if this fails send RECORD_UDPATE_ERROR) 2b. Message guaranteed to be delivered to database (if this fails send RECORD_UDPATE_ERROR)

For us, adding 50ms of latency would be a fine exchange for a guarantee that we would be notified of inconsistencies.

WolframHempel commented 7 years ago

Maybe I'm misunderstanding the exact scenario. In your description above, the deepstream server shuts down the exact moment between a write to cache and a write to the database - without deepstream, how would these various errors/ guaranteed writes be sent?

WolframHempel commented 7 years ago

It's probably also worth mentioning that the cache solely exists to improve speed. The cache and database connector interfaces are interchangeable - if you wouldn't mind a slightly higher latency (as in the additional amount of time it takes to write to / read from Mongo or Postgres as opposed to Redis or Memcached) you could simply use deepstream with a database connector as cache connector and leave the database connector disabled.

layanto commented 7 years ago

Does deepstream cluster provide redundancy or only horizontal scale for performance? If one server shutdown before it writes to database, will another server in the cluster complete the pending work if the shutdown server?

rileylark commented 7 years ago

Does deepstream cluster provide redundancy or only horizontal scale for performance? If one server shutdown before it writes to database, will another server in the cluster complete the pending work if the shutdown server?

I believe the answer is that there is no sense of "pending work" and the data will just be lost / inconsistent.

What I want is a guarantee that clients will never be sent data that's permanently inconsistent with the database. I'm fine with eventual consistency, but I can't have the case where my cache is cleared and suddenly new reads are getting different data than clients that have old data from the old pre-clear cache. That state of inconsistency sounds so scary to me because we'd never be able to debug it and would probably not even notice it.

Firebase sends acknowledgements of writes back to clients, which is nice for two reasons:

  1. Failures like these are detectable on the client, so that UI can be updated appropriately
  2. Client UI can show e.g. "saving..." -> "saved" indicators

In your description above, the deepstream server shuts down the exact moment between a write to cache and a write to the database - without deepstream, how would these various errors/ guaranteed writes be sent?

By writing all requests to pubsub before the cache or the database, you could reduce the requests to a single atomic operation: post to pubsub. From there if either writing to cache or database failed, pubsub could retry. After x retries some process can send RECORD_UPDATE_ERROR (and maybe clear the cache?)

(But still, in the absence of client notification of success, the client would have no way of knowing the write failed if DS fails before writing to pubsub)

WolframHempel commented 7 years ago

@rileylark I'm highly committed to making deepstream work for you and I understand your concern. I know that the problem described whilst theoretically possible doesn't exist in practice as it would require both the deepstream-server and the database to crash in a precise sequence after receiving a message and after storing it to the cache yet before the database - which is also failing - responds. As you said, using "pub-sub" or a buffered message queue would only move this theoretical scenario forward to "what happens if deepstream fails the exact moment before pushing a message on the queue"

deepstream takes two steps to prevent inconsistencies.

Having said that, I am starting to accept that I am being too stubborn. I know that it works and that inconsistencies are both extremely rare and eventually reconciled, creating a self-healing system which is significantly more than most storage solutions provide by themselves. However I do acknowledge how uneasy this makes a large part of our user-base and at which great length some have gone to implement something that resembles per-write acknowledgements. I'd therefor suggest I'll discuss with the team to add them as a feature, disabled by default. Going forward, there would be two new aspects. On the server a new config option

confirmWrites: true

on the client a callback

myRecord.set( 'somePath', 'someValue', error => {/*null for successful writes */});

Curious about your thoughts

ronag commented 7 years ago

Why have a setting at all? If a callback is provided then confirm otherwise don't...

WolframHempel commented 7 years ago

Write acknowledgement is not a client operation. The server has to send out an ack message, effectively doubling traffic for write heavy apps and potentially blocking update/patch messages in a different sequence. This will slow the system down by multiples of its current performance and is unnecessary forr a vast number of usecases, especially with one-directional dataflows (price grids, trading, realtime dashboards etc.), hence the explicit option.

If what you're hinting at is to make it dynamic by adding a confirm flag to the update/patch message and do it on a per transaction basis, that's an interesting idea though. It will however introduce a lot of complexity, especially if multiple clients write to the same record, some requiring confirmation, some don't

ronag commented 7 years ago

I'm hinting at a flag.

rileylark commented 7 years ago

Wow, this response is amazing! I hope I do not come off as pushy on this - I really appreciate deepstream even without client acknowledgment! But, yes, client acknowledgement would be a requirement for me to use DS in production.

Please let me know if you'd like to talk more about my reliability requirements or use cases! FYI, we have an app deployed to schools that currently uses Firebase to communicate between teachers and students. In this case the current DS behavior would not be "self-healing" - if a student writes an answer to a test today, and then tomorrow the teacher sees no response (or a different response), that's not something the student can quickly correct. I hear you saying that this will probably never happen, and I agree with you, but when a teacher comes to me saying "hey my students' responses aren't showing up," I don't want to even have this possibility on my mental landscape while I'm trying to understand the problem.

Specific feedback:

re. eventual consistency of current behavior

I think the current behavior can lose data and become inconsistent. I am super-ok with things being out of sync for awhile but the possibility of the client, cache, and database permanently disagreeing is what is unacceptable. I cannot depend on there being one last write that sets the record straight everywhere.

config option for turning on acknowledgments

A config option would be great. I'd set it to true all the time and I would not want it to change if I don't pass a callback. Even if I don't set an error callback, I certainly want my clients notified if some data didn't get updated. Firebase handles this by acting as if the data has changed back to its previous value, which is fine for me - the student won't think their response has been submitted because it would flick back to being blank.

e.g.:

myRecord.set( 'somePath', 'someValue') // no error callback provided
  1. Local listeners receive new value 'someValue'
  2. Client doesn't receive acknowledgement for over 1000 milliseconds (or whatever)
  3. Some error event is fired somewhere
  4. Local listeners receive "new value" of whatever the previous value was, so in case I didn't handle the error in step 3, at least my local data is in sync with server.

If this was how it worked then I could live a care-free life setting data on the server, and never polling to see if I have up-to-date data.

acknowledgment callback

I do also love the acknowledgement callback with error parameter. In our app we have some quickly-updating values that we don't really care if some of the set operations fail in the middle, but we have other values like user preferences where it's really important to be sure they are permanently persisted before optimistically updating the UI. An optional callback lets us choose whether or not to be optimistic very easily. Note: if the presence of the callback changed the reliability of the set then we would probably pass a no-op callback to every single set call.

the very next write would detect the broken version chain and reconcile the error

I see what you're saying and this is often how we use firebase, but we need to be guaranteed that the last write is permanently reconciled (eventually) as well.

the responsiveness of your community

This has been amazing! Please take all this with a grain of salt - I would like to be involved even if you do not prioritize my concerns. Also, please let me know if this is something you would be interested in my help with. Our company is currently dropping $10k/mo on Firebase and so it makes some financial sense for us to lend a hand with this, as well as the super-fun of working on such an exciting project. If you'd like to talk more about how we can help, I'd be happy to jump on a call.

Thanks again!

ronag commented 7 years ago

I think this could be solved in a simple (?) and performant manner using fencing, e.g.

<<FENCE_START 0
<<UPDATE 1
<<UPDATE 2
<<UPDATE 3
>>ERROR 1
<<FENCE_END 0
>>ERROR 2
>>FENCE_ACK 0

i.e. the client batches and surrounds UPDATE messages with FENCE_START & FENCE_END. After FENCE_END the server sends any failures followed by a FENCE_ACK. If no errors are received before FENCE_END it can assume that everything is ok and persisted to storage. Of course this would only happen if any of the updates have a corresponding error listener.

This would of course mean that the result callback will not be called as fast a possible, but with some small latency, which I believe is ok in the cases that actually require confirmation.

rileylark commented 7 years ago

Of course this would only happen if any of the updates have a corresponding error listener.

FWIW, our team would not get any value out of the per-write preference for acknowledgement - we'd rather be confident that every write was successful.

rileylark commented 7 years ago

Re. batching acknowledgements: a delay of under 50ms would be fine for our use case and I think you are right that any UI waiting for an acknowledgement probably wouldn't mind a small delay. OTOH the bandwidth saved seems small and if batching acknowledgements requires additional complexity on the server it might not be worth it. Bandwidth is cheaper than RAM for us.

ronag commented 7 years ago

we'd rather be confident that every write was successful.

Then make sure to provide an error listener with every write...

bandwidth saved seems small

Actually, it's not entirely negligible in many cases... I think an all or nothing solution is not a step forwards...

WolframHempel commented 7 years ago

I think there's a bit of a misunderstanding: We already notify clients if a write failed, the feature we'll add is an explicit confirmation that it succeeded. Therefor (I'm aware that I originally suggested them) explicit "always on" confirmations aren't needed as this is already covered.

For technical reasons fencing / commit blocks from the client don't make too much sense. Multiple clients write to the same record and the server already maintains short-lifed update queues to deal with that. The write confirmation will be send as soon as the current update queue completes successfully (most of the time this queue will only contain a single update, only at high write concurrency there'll be more)

rileylark commented 7 years ago

Imagine if you read a section of the postgres docs and it said in small type, "by the way, if you don't check the return value of the insert command, we don't guarantee the insert succeeded." Or imagine if you were saving a file on your hard drive, but unless you checked to see that it was there, there's a chance the file would disappear after you restart - it was saved in the cache but not in the DB.

I think there's a bit of a misunderstanding: We already notify clients if a write failed, the feature we'll add is an explicit confirmation that it succeeded.

Yes, I don't understand this. Writes can fail without the clients being notified! That's the whole reason I posted this issue.

Not only can writes fail after getting to DS, as we have discussed, but there are many other scenarios where a message can leave a client and never even get to DS. My product is deployed inside school buildings, which have the craziest sets of browser extensions and network filter rules you can imagine. In the case of a network failure after a write, the client will think the write has succeeded even though it failed, and DS will never notify it otherwise.

Then make sure to provide an error listener with every write...

👍 will do - again, I love DS in its current form and we have the resources to make it work like we need it to! Thank you for all your work!

WolframHempel commented 7 years ago

We do provide mechanisms to trace write errors and with the future release we'll provide callbacks for guaranteed writes fully enabling your usecase / fulfilling the requirements of a project management app. Please don't underestimate though that deepstream is a very broad church that caters for a lot of usecases where write consistency is less - or no issue at all such as multiplayer gaming, financial trading or realtime dashboards.

rileylark commented 7 years ago

Great! Let me know if there's anything we can do to help, either with this issue or with another area! The enthusiasm for deepstream is spreading across our office :D

yasserf commented 7 years ago

@rileylark We are looking to release this functionality next week Friday as part of 2.1, we'll give you a heads up as soon as it's RC for feedback.

There are a few small quick wins we want to first get out in the next few days, mainly feedback we got from the large amount of clients the community is writing.

In regards to help, we have a waffle board for the current roadmap and bugs. If your using a specific front-end tech like react, vue or something else would be interesting to see how and if we can create any plugins to make things integrate better.

The best help though would be if you could write a blogpost about how you use deepstream, your use case is quite interesting and it would be great to spread the word!

WolframHempel commented 7 years ago

@rileylark This has been released as part of 2.1. Please find the updated documentation here

snc commented 7 years ago

Let's assume there is a record {"firstname": "Homer"}... when running the following code, nothing appears in the console:

record.set('firstname', 'Homer', err => {
  if (err) {
    console.log('Record set with error:', err)
  } else {
    console.log('Record set without error')
  }
})

So the callback is not executed, is this expected behaviour?

AlexBHarley commented 7 years ago

@snc this won't be written to cache/storage as you're not making any changes to the data. This occurs with or without write acknowledgement. The example on the website could take this into account though and we'll get that changed

snc commented 7 years ago

OK thanks for clarification.