audit of redis driver usage

nathan-muir commented 6 years ago

Hey @theodorDiaconu

I've been investigating the story for failing over redis, and wanted to share some findings -

I was also searching for issues that could cause the cache in ObservableCollection to remain out of sync/dirty.

Considering-

tfetch = When cursor.fetch is called
tsub = When the subscription is confirmed by redis

Then, any situation where tfetch < tsub could result in dropped changes & a dirty cache.

`ready`

Currently, we watch the connect[1] event, and re-poll any observers on reconnect.

However, during a reconnect, subscriptions are not re-established until the ready event is fired.

If we re-fetch documents during connect (tfetch) but the subscriptions aren't active until ready (tsub), therefore tfetch < tsub results in a dirty cache.

So when re-establishing a connection, PublicationFactory.reloadAll() should wait for the ready event instead of connect.

[1] The connect event is fired immediately when the connection is (re)established [2] The ready event is fired once the server accepts commands, and, on reconnect after subscriptions are restarted, and outstanding/queued commands are all sent.

`subscribe` can be delayed

When establishing an observeChanges, we subscribe to the channel in redis, and then fetch the documents.

This would normally result in tsub < tfetch, which is good.

Except, there are some circumstances when calling redis.subscribe(...) that the command could be delayed. For example, if the stream buffer is full [3] it wont be sent immediately.

These circumstances could result in tfetch < tsub, and a dirty cache.

It would make sense for new RedisSubscriber(...) or redisSubscriber.init() to only return once the subscriptions are actually active on the connection. So we could ensure no changes are missed between the initial fetch and configuration of the subscribers.

Caveat: This would mean, that new subscriptions would hang while redis is disconnected (eg, during a failure), unless explicitly allowed/bypassed (eg, it would poll once redis was back).

[3] https://github.com/NodeRedis/node_redis/blob/master/index.js#L883

No way to reload data if a node fails

Not quite an issue with redis driver, but more the architecture -

Currently, if a meteor-instance disconnects from redis, it will refresh/repoll it's subscriptions when it reconnects.

During a redis failure, if all meteor-instances disconnect & then reconnect around the same time, the view of data should be roughly consistent.

However, if a single meteor-instance fails completely with unpublished mutations, those changes wont be seen/recovered by other severs.

As an ObservableCollection can be very long-lived, this means stale data could remain on other severs for quite some time.

It would be good to have something like:

ObservableCollection automatically re-polls after some interval (eg, 10mins?)
Add an Events.POLL message type, which causes the observer to re-poll mongo

theodorDiaconu commented 6 years ago

@nathan-muir nice audit

We're dealing with a dual write system. It's a hard problem to solve. This is why this: https://github.com/tulip/oplogtoredis got born. I'm going to try to understand everything you wrote here (some things are a bit unclear)

The ready event is fired once the server accepts commands, and, on reconnect after subscriptions are restarted, and outstanding/queued commands are all sent. [2]

Why is this a problem, I'm guessing there is a queue of redis commands that are on stand-by to be sent even if the conn failed (maybe?) -- have to investigate myself

I still don't understand why we need to make the subs hang.

nathan-muir commented 6 years ago

Hey @theodorDiaconu , Yeah, it's a tough problem to solve. (Might be easier to migrate to CouchDB)

I've updated the original post with further clarification.

heberallred commented 2 years ago

These are definitely the core issues of the redis-oplog package. I have created a work-around for one of these issues which reloads the data to update the observer cache on each new subscription with a minRefreshInterval which could be increased if users want less DB load. It reloads server cache only after the initial cache is sent to the client since most of the time the cache is correct and we don't want to slow down page loads for users. Any discrepancies will be sent as adds/updates/removes after the initial adds are sent to the client. I know this is not ideal for some use cases, but in our case it works great. We'd rather have a little extra DB load than to have server cache get out of sync (with no way to get back in sync without a server restart). It's still blazing fast for reactive updates, too. I'm sure someone who understands the package could write this code more elegantly, but I'm putting our work-around here in case it helps someone. https://github.com/zzbots/redis-oplog/commit/d4a8c40379319bf05c6b5593b5718b9216958373

Now we're trying to deal with the other issues of updates to records getting missed client side in the browser (UI issue only) due to subscription not finishing before the updates are sent from Redis.

cult-of-coders / redis-oplog