A fork or redis-oplog for infinite scalability

ramezrafla commented 3 years ago

@evolross @SimonSimCity @maxnowack (pls feel free to tag more people)

We faced major issues with redis-oplog in production on AWS Elatic-Beakstalk, out-of-memory & disconnects from redis. After some research we found that redis-oplog duplicates data (2x for each observer) and re-duplicates for each observer (even if it's the same collection and same data)
Also, DB hits were killing us, each update required multiple hits to update the data (to avoid race conditions). This is also another major negative -- not scalable
Finally, we couldn't read from MongoDB secondaries (given we read much more often than writes, it would result in much higher scalability)

Also Redis-oplog is slowly going into disinvestment

We create a fork (not public yet, considering my options) which does the following (more technical notes below)

Uses a single timed cache, which is also the same place you run 'findOne' / 'find' from one -- so full data consistency
Uses redis to transmit changes to other instance caches -- consistency again
During updates, we mutate the cache and send the changed fields to the DB -- instead of the current find,update, then find again which has 2 more hits than needed
Same for insert - we build the doc and send it to the othe instances
We use secondary reads in our app -- there are potential race conditions in extreme case we are working on using redis as a temp cache of changes

RESULTS:

We reduced the number of meteor instances by 3x
Faster updates as less data is sent to redis and much fewer DB hits
We substantially reduced the load on our DB instances -- from 80% to 7% on primary

Here is the technical:

A single data cache at the collection-level stores the full doc, the multiplexer sends to client data fields based on the projector (i.e. fields: {} option). collection.findOne fetches from that cache -- results in cache hits of 85-98% for our App
That cache is timed, timer resets whenever data is accessed
Within Mutator we mutate what is in the cache ourselves (if it's not there, we pull it from DB and mutate) - in other words, we don't do an update followed by a find so usually a single db hit (update). We also do a diff to only dispatch fields that have changed. Same thing with insert, we build the doc and dispatch it fully to redis.
We send to redis all the fields that have changed, the redis subscriber uses that data to extend the data that is stored within its cache (or pull from DB then extend data from update event). Inserts are trusted and stored in cache.
We now use secondary DB reads which results in much higher scalability. This is why we have #3 and #4 above, we trust redis and cache over db reads to avoid race conditions. We do get race conditions every once in a while (e.g. new subs and reads), and we know where they would occur and catch them there. Otherwise, we always trust the cache vs data read from the DB

QUESTION: Is this of interest? Should we have a community version of redis-oplog that we all maintain together?

ramezrafla commented 3 years ago

Adding @afrokick, @edemaine, @adamgins, @hanselke, @znewsham, @vlasky, @karen2k, @Floriferous, @nathan-muir, @rj-david

Sorry if you feel spammed, pls unsubscribe if not interested. Trying to involve most recently-active users.

afrokick commented 3 years ago

@ramezrafla Did you think about merge your work to this repo?

edemaine commented 3 years ago

This sounds very cool! I do worry about cache consistency / race conditions though. In my app, several users/servers could be modifying the same object at almost the same time, so I wouldn't fully trust the cached copy -- perhaps while doing a local update there's also an update coming (but not yet delivered) via Redis from another server when you do the update? Is there a way to detect this and fix things?

ramezrafla commented 3 years ago

@edemaine

We can add intelligence to redis via a LUA script (or a separate process that does nothing but manage data). redis is acting as a sort of edge cache in this case and becomes the 'golden'. It will require some further work.
We can also use the GO listener @SimonSimCity created and follow DB as the golden source.

But you nailed it, to avoid race conditions the current redis-oplog is cumbersome and heavy in db hits

@afrokick There are some serious departures from the current approach. Some developers are happy with the way it is and I don't want to disturb them. Also, I don't like the swiss-army knife approach. The code was very complex to please a lot of people. There were bugs, old stuff etc.

Jack of all trades master of none :)

znewsham commented 3 years ago

This does sound cool, I'd be worried about the memory cost of this though: A single data cache at the collection-level stores the full doc - is this all documents, or only the documents currently subscribed to? Similarly, is it actually the full doc, or the full doc with respect to all currently active observers?

ramezrafla commented 3 years ago

@znewsham You are right to be concerned about memory, I was too

We use way less memory than current redis-oplog -- as currently you have (as mentioned above) 2x for each observer and you reduplicate for each new observer (that is not already there)
We store the full doc but only when it is used. If not, we would never touch it. Then clear after a certain timeout (which is configurable based on settings)

Instead of full doc we could look into fetching only needed fields -- but that complicates things and somewhat negates the approach of not making db calls. It all depends ...

znewsham commented 3 years ago

regarding the timeout, presumably it is a timeout that runs after the last observer has stopped observing? E.g., it will never expire when needed?

I think the storage of full documents would need to be an option for this to be useful, I'm thinking of a few different scenarios here:

tables/lists that render perhaps 100 documents (only requiring a few fields on each), this would require storage of lots of potentially large documents - in some of the projects I work on the documents can be hundreds of KB each, and if multiple users are subscribed to different subsets this can get large quick.
server side observers (e.g., to maintain an in-memory cache) will subscribe to ALL documents in a collection, again requiring very few fields, but increasing this to the entire document could cause a memory explosion.

I looked into this issue a while ago, and took a slightly different approach using a global map of documents for a collection. The individual observers all point to this global map rather than maintaining their own document store, and add/remove fields to it as necessary, so regardless of how many observers you have, a document exists exactly once while observed. This solves part of the memory problem but not all of it - the other part is in the individual subscriptions (per session) I resolved this by in some cases removing the mergebox - in a minority of cases this results in too many changes being sent to the client (typically where reactivity is being managed by the server, e.g., counts), but in most cases this results in no extra network use, and vastly lower CPU/memory usage.

I'm also interested in the 80% -> 7% CPU drop on your primary, is this mostly caused by the cache (not needing to hit DB at all) or by spreading the workload across many secondaries?

ramezrafla commented 3 years ago

So the way we did it is based on access. Even if you are subscribed to the document but it never changes, why would we keep it in cache? So I like the timeout approach more than server-side observers. We fetch the doc when it is needed (say find or pub) and clear it when it wasn't accessed in a while.

In terms of full-doc vs selected fields. I see the issue. We don't have large docs by design. We could look into it, maybe have an extra setting. I am trying to avoid swiss-army approach. I want to fit a specific need for 95% of the people than have it bloated and hard to maintain / buggy. Something to think about.

The drop in CPU for the primary was due to caching i.e. less hits. The 2 secondaries saw an increase due to secondary reads of only ~8% each.

So effectively we saw a global drop due to caching of 80 - 7 - 8 * 2 = ~60%. And we saw a drop on primary due to caching + secondary reads of 80 - 7 = 73%

I know we can do better with our app and reduce db hits further (e.g. more ID-based calls) -- if a find or update call is NOT ID-base we HAVE to go to the DB.

znewsham commented 3 years ago

Got it, regarding the cache - we use it for super fast lookups (fields that are heavily referenced by server side methods) there's no way of knowing which documents are required in advance, so we load everything. We could probably improve this with locality - many older documents aren't heavily required by the server, so we could probably pay the price to do the DB round trip in those cases.

theodorDiaconu commented 3 years ago

@ramezrafla and everyone. I repeat and stress this out: RedisOplog is not going in disinvestment

We are taking care of big issues and we still want to ensure a good experience. The option you mention can be easily implemented as an iteration over "protection against race conditions" kind of subscription. We need to merge it in. Your solution overrides completely the race-condition protection. Most of my clients even private can handle with the current version of RedisOplog and use of channels subscriptions for over 10k users, CPU is stable and with Race Condition Proofing.

ramezrafla commented 3 years ago

Thanks @theodorDiaconu

Sorry if you were offended by my comment. Not the intent. The pace of updates has admittedly slowed down, and there are many issues + PRs open. If this is sufficient for your customer needs, then great.

And also a thank you for our private conversations.

As discussed, I am gauging community interest in this approach to reach the next level of scalability. We are not talking 10k users as you mention but millions of users. The current approach will not scale well (at least not with us), but may likely be sufficient for your customers. We lost business because of it as the servers failed in production and we had to scramble to find a solution. This new redis-oplog is the result of long hours of work and testing and works great in production (for us).

If no one is interested, we will proceed on our own. It's a business imperative for us.

copleykj commented 3 years ago

@ramezrafla if you are interested in releasing this and having it maintained by the community, the Meteor Community Packages org would make a great home.

ramezrafla commented 3 years ago

@copleykj Thanks for your response.

Releasing a package for the community takes time and effort to prepare, document and support. If there is no interest, and @theodorDiaconu + others are somewhat maintaining this package, then there is point.

That's the point of this thread, are others facing the same issues and are they interested? Seems like no.

edemaine commented 3 years ago

It's good to explore new ideas! I think the only question here is whether a fork or merging or something else is better. Publishing your code would be a good start, so we can explore the differences/similarities...

A natural question is whether some of your ideas can be incorporated into this package, e.g., when protectAgainstRaceConditions is false. I definitely like the concept of not storing one copy per listener.

znewsham commented 3 years ago

I think it's worth releasing it - always good to get multiple sets of eyes onto it, perhaps the way you've solved a specific set of problems is better than the way others have, perhaps there is a way to make your changes less drastic (e.g., optional) to make it a better candidate for merging. Hard to tell without seeing any code :)

ramezrafla commented 3 years ago

@edemaine @copleykj I'll publish the code ASAP then The changes are very deep in Mutator, extendMongoCollection and in observeChanges Also, serious changes in the redis listener to affect the stored cache

It's a philosophical difference, I can't see how we can merge. But I'll leave it up to you to give me your feedback.

ramezrafla commented 3 years ago

New repo is online: https://github.com/ramezrafla/redis-oplog

Please don't use in production until you are sure it is working for you in a staging environment with multiple servers.

SimonSimCity commented 3 years ago

@ramezrafla since you mentioned it, have you tried to use the GO application in your use-case to see what kind of improvements it would give you?

I didn't write the application (credits to @benweissmann, @mammothbane has also contributed quite a lot lately) but I'm a heavy user of redis-oplog alongside with oplogtoredis. I don't know how it in your case would reduce the reads on the application. Our use-case is of a very write-intensive application. Not every write is interesting for the user as it happens, and it's quite a limited user-group having access to the website. Once a change happens, those users want to see them immediately.

For this reason the combination fitted quite well for us, because it allowed us to skip evaluating changes on a collection nobody has a publication open on. Your use-case might well require a different solution. But I'm thrilled to hear your use-case - or where you see the biggest bottle-necks. Is your application write- or read-heavy, or is it rather balanced? You've most likely already tried the fine-tuning tips ... Would be nice to have a higher-level-perspective of things you were facing.

ramezrafla commented 3 years ago

@SimonSimCity thanks for your message.

First, a side question: based on your description of your application, why don't you just use regular mongo oplog? Meteor was specifically designed for your use-case. You don't need to bother with at least 3 other external packages and services (redis, oplogtoredis and redis-oplog)

Based on what you described, our application is the exact opposite of your use-case: we are heavy-reads with a large real-time user base. Which means all the duplicated data that redis-oplog stores kills our memory (I mentioned it to @theodorDiaconu, the 2x duplication of data in ObservableCollection is a bug with no real purpose. As well there are issues where data fetches are not needed, e.g. in ObservableCollection.removed .. you just need the _id not the full doc).

Personally, when things went sour and I had to go into the code to figure things out and saw this trivial bug, I lost confidence in the original package. We lost business because of this! People can say all they want, redis-oplog is NOT meant or optimized for large-scale applications and is in need of some serious attention to clean it up (unused code and trivial optimizations for example -- take a look at our ObservableCollection and Mutator).

Our challenge gets even more complicated when we scale up. The heavy-reads killed the Memory and DB, we hit 100% on the primary node -- this is especially true with the continuous reads to avoid race conditions. Caching and mutating locally solved the issue for us (since you have the data, you mutate it and send it to your users without waiting for db results).

Now ... we do intend on using oplogtoredis at our next jump point in user base size (which seems very soon). Here is why: if we stopped server-side optimistic updates + kept reads on secondaries, then our redis-oplog would hit the db for mutations and cache the results back from redis. Most of the time it's a one-way db hit. This would be the ideal scenario (we are only mutating on our side as we want to avoid db hits -- now that the db is sending us the data via oplogtoredis we can trust its data).

I hope I answered well. Please let me know if I can clarify further.

ramezrafla commented 3 years ago

@SimonSimCity I made quite a few cleanups to the message above, please read from Github directly. Sorry -- went a bit too fast this morning.

SimonSimCity commented 3 years ago

why don't you just use regular mongo oplog?

Well, because each Meteor instance, using the traditional mongo oplog, scans through the full oplog on the database. our write-heavy workload changes documents which are not always monitored by the user. The load of those operations kept our instances busy at 50% CPU - which was the reason we opted in for redis-oplog. Since the traditional way of this package does not notify the user about updates taken by a non-meteor application, we now let the GO application oplogtoredis read all the changes, process it and forward it to Redis. All applications subscribe to Redis and get only notified about changes on documents of a collection the user is actually viewing and then might have to ask the database again if the change-set doesn't contain all the necessary information.

Would be nice to have a chat with you on the Meteor Community Slack group (https://github.com/Meteor-Community-Packages/organization#slack). Just ping me there privately.

evolross commented 3 years ago

Just now seeing this. Been busy with our production app seeing a lot of use.

Something I’m seeing when our Galaxy containers get a lot of users (e.g. four Galaxy Quad containers sharing 3K to 5K users - which sadly doesn’t seem like a lot for four quads) is dropped DDP or Redis commands. A small amount of users’ apps will just not get the update to stay in sync with the presentation.

Could this be related?

jasongrishkoff commented 3 years ago

@evolross been having a similar issue with my chatrooms using redis-oplog. I've got the subscription using a unique namespace, and each insert/update uses that same namespace. It works most of the time, but if someone has had the tab in the background for a little while (as an example) it seems to stop updating the subscription (I can tell I'm still subscribed, but the newest documents haven't come through). Only way to get the fresh info is to refresh or create a new insert from client side. It's a bit funky when you write a chat message only to see 5 new ones pop up that you hadn't seen because you weren't getting the updates.

evolross commented 3 years ago

@jasongrishkoff Have you tried experimenting with @ramezrafla's fork yet? I'm going to be getting to this here in the next week or so, so I should know if it helps.

jasongrishkoff commented 3 years ago

Yes @evolross it looks very promising, but as soon as I rolled it to production my 3x quad galaxy containers all hit 100% CPU and crashed. I've opened a few issues that @ramezrafla is looking into :)

ramezrafla commented 3 years ago

Thanks @evolross and @jasongrishkoff

mini-mongo doesn't support positional operators I am working on an escape hatch to fall back to the old "fetch-update-fetch" cycle for those

evolross commented 3 years ago

Just a thought (since there's so many here that have experience with Meteor reactivity at scale)... has anyone experimented with going back to plain old oplog-tailing? Perhaps with larger-sized containers of a lesser quantity? I noticed Galaxy now offers 8x and 16x containers. We run on Galaxy, so this thought had crossed my mind when I saw that those were available.

I wondered about performance with oplog-tailing and a lesser number of containers.

jasongrishkoff commented 3 years ago

Honestly, the main reason I've been experimenting with redis-oplog is because my mongodb instance is getting hammered and I've had to scale up considerably. It's now my biggest cost. A significant amount of that demand is coming from subscriptions that I can't replace with methods, and from what I can see, redis-oplog hasn't actually helped much there. So I am indeed debating switching back to regular oplog-tailing to see what happens. Seems way less complex, and allows me to just trust Meteor to do its thing so that I can continue on with the development of the product (which is where the most can be gained from a business standpoint).

I get the idea that @ramezrafla's solution would help considerably take pressure off my database because of the caching involved? But I worry about that cache properly invalidating / propagating changes to users when it needs to.

ramezrafla commented 3 years ago

@jasongrishkoff

I truly feel your frustration. We faced the same exact issue you faced. To be honest, I would never go back to oplog-tailing as you cannot scale. You know the drill, as more meteor instances come on board, your load just watching the oplog grows until you hit a breakpoint where you are spending more time looking at the oplog than doing actual work.

My solution DOES work and has been in production for a few months now. I personally invested hours on it and it's crafted and tested with care. I almost lost business because of the original redis-oplog, I had to work hard, pull-in the nights, and beg for another chance.

To get to what you want is really trivial. It's 95% there. Your need is an escape hatch (i.e. a bypass) at the beginning of mutator to check if the query is not supported by mini-mongo ($slice, positional operators) and revert to mongodb. If you are willing to invest a few hours testing it with me, I am in!

ramezrafla commented 3 years ago

I get the idea that @ramezrafla's solution would help considerably take pressure off my database because of the caching involved? But I worry about that cache properly invalidating / propagating changes to users when it needs to.

There are many mechanisms in place to automatically update the data when it changes, including a race conditions detector which pulls from the DB when there is risk of the data being stale.

jasongrishkoff commented 3 years ago

@ramezrafla 100% down to test. I'll find some time today to send you a brief overview of how my app works just in case you see areas that might cause concern.

jasongrishkoff commented 3 years ago

Just an update, this morning it dawned on me that I could offload my cron jobs to my mongodb secondary replicasets using readPreference=secondaryPreferred, which has made a pretty big difference. Frees up more on the primary!

cult-of-coders / redis-oplog

A fork or redis-oplog for infinite scalability #362