Best practices for bootstrapping many grains

kehigginbotham commented 7 years ago

Hello,

First, hats off to the Orleans team. Orleans is a fantastic framework that our team feels fits our needs well.

That said, we are curious about the best practices for bootstrapping a large number of grains. In our use case, we have ~30K grains we need to activate before our silo is finished starting. Each of these grains represents a record in a legacy database. We will also have new records/grains added throughout the silo/cluster run-time. For now, assume that updates/deletes are commuted correctly to the legacy system.

We are being forced to maintain BC with the legacy system, as our new Orleans-powered system will run alongside the legacy system for a brief period. Our primary problem with this constraint is the dependence on SQL Server and resulting latency.

For clarification, we intend the term "initialization" to mean "the first time a grain is activated in a given silo/cluster up-time period".

We have tried the following methods for bootstrapping and met mixed results:

Placing initialization logic with direct SQL in the grains' OnActivateAsync(): results in 1 SQL request / grain. which is too much stress on our SQL server and too slow.
Adding a method on the grain to set initial state, then fetching all records and activating all the grains by setting this initial state: Only one SQL call, but we can't differentiate between initialization and subsequent re-activations.
Calling proxy grain(s) from the grains' OnActivateAsync() (proxy grain fetches all / subset of records on activate): SQL traffic is much less, but subsequent re-activations would fetch a stale record, and introducing a timer to refresh the records does not eliminate this possibility.

Thus, here is our question: What is the best practice for bootstrapping/initializing these grains?

sergeybykov commented 7 years ago

Thank you for your kind words!

I'm not sure I fully understand the need for

~30K grains we need to activate before our silo is finished starting

Are these grains unusable individually unless they all are activated? The general model in Orleans is that grains are mostly independent from each other, and have their own activation-deactivation lifecycles. I wonder if all these 30K grains need to be activated together, does it make sense to model them as individual grains, and not a single grain or a small number of grains (for example, 100 grains each holding 300 items).

If you do need to model these items as 30K grains but with lower load on SQL, I suspect a variation of your method 3 can work. A local proxy grain could batch reads and cache the data in memory. You could eliminate staleness by invalidating records in the local cache before performing an update.

That being said, more details about the specifics of the requirements should help with suggesting the right pattern.

kehigginbotham commented 7 years ago

Thanks for the response @sergeybykov. I will endeavor to be more specific.

The ~30K grains individually correspond to discrete domain models - specifically, each grain models an "inventory item" in our "inventory". Each "inventory item" is individually mutable, and they are purely independent of each other.

The reason we need to activate them all before out silo is finished starting is exclusively UX. When a user loads the front-end application (which in turn fetches the data from an Orleans client in Web Api), they are presented with a list of said "inventory items" according to a "filter" they have control over. This "filter" may include all, none, or some of the "inventory items". After silo startup, if the "inventory items" are not initialized during bootstrap, the first user to load the front-end application is forced to wait ~20 seconds before any data appears in their list.

Method 3 is what we've gravitated toward, especially since the total number of "inventory items" can fluctuate wildly throughout the day. We also intend to spread the load across multiple re-entrant proxy grains, each responsible for a subset of "inventory". Our only concern with this method is that the "inventory items" can be modified at the database level by the legacy system, out of band with the new system (yes, it is as ugly as it sounds). We aim to partially address that issue by refreshing the proxy grain(s) every 5 seconds or so and updating the proxy-cached "inventory item" from SQL prior to any update in the new system.

Thanks again for the feedback. We appreciate any tips, recommendations, etc. that you all can offer. Hopefully this discourse will help others in the future.

sergeybykov commented 7 years ago

Thanks for the details. The background is more clear to me now. May I ask a stupid question though?

When a user loads the front-end application (which in turn fetches the data from an Orleans client in Web Api), they are presented with a list of said "inventory items" according to a "filter" they have control over. This "filter" may include all, none, or some of the "inventory items". After silo startup, if the "inventory items" are not initialized during bootstrap, the first user to load the front-end application is forced to wait ~20 seconds before any data appears in their list.

Does the UI immediately show data (or its derivatives) for all 30K inventory items, and no pagination is involved?

Our only concern with this method is that the "inventory items" can be modified at the database level by the legacy system, out of band with the new system (yes, it is as ugly as it sounds). We aim to partially address that issue by refreshing the proxy grain(s) every 5 seconds or so and updating the proxy-cached "inventory item" from SQL prior to any update in the new system.

If the underlying state can be mutated externally and the max staleness you can afford is 5 seconds, then the situation indeed looks tough. I can only think of potential optimizations. For example, make the proxy grains stateless workers to eliminate cross-silo traffic to them at the cost of duplicating the cached state in every silo (if memory and global consistency isn't a concern).

kehigginbotham commented 7 years ago

Our UX guy has been pre-loading all 30K inventory items so that filtering, et. al. does not introduce any visual latency. This solution obviously does not scale, and our team is working to gently prod our UX guy in that direction. Even with the pagination, we hope to minimize latency on the client by having the next set of requested inventory items already initialized (which we would if all 30K are initialized).

Indeed, our situation is a tough one. We're currently exploring your stateless worker suggestion, but I fear global consistency will be a necessity. We have had some success in expanding our proxy grains to hold ~1K inventory items each with periodic polling. I'll report back periodically with our findings as we scale up our application.

Thanks!

sergeybykov commented 6 years ago

I'll close this for now, as I don't anything for us to do here. Feel free to reopen if needed.

dotnet / orleans

Best practices for bootstrapping many grains #3462