Closed Eldar1205 closed 8 years ago
Do you need EXACTKLY 50K, or 50K on average?
I presume on average. Then the solution is very simple, and you don't need access to any Orleans internals like ring.
1) On every silo bootstrap provider calls into X different local placed grains (not stateless workers, regular grain with PreferLocal).
2) each grain asks to never be deactivated - DelayDeactivation
http://dotnet.github.io/orleans/Advanced-Concepts/Activation-Garbage-Collection
3) each grain starts a regular timer (not reminder) to tick every Y milliseconds. Inside the timer, do your operation. Use random start time for timers to ensure the ticks are jittered (not tick all at the same time).
4) Figure out X and Y based on the number of silos and target ops (50K).
That is it. This is a standard, by-the-book Orleans pattern. Supported natively by the combination of the public mechanisms, no need for internals.
If you also want to support a changing number of silo and/or a changing number of target ops (which is a more advanced scenario than you originally described), it can still be done with the public abstractions, with a bit more complexity, but not much really. I can provide details on how, if you need to.
I do have a need for varying number of active silos (elasticity) and the number 50K is not an exact one, it's based on user requests, but it should change on an hourly basis, and that's if our service is as successful as it could be. 50K is an approximation of max event hub partitions that'll be in the system for time to come, and my design has to meet this to keep up with tough SLA requirements. Three notes:
[PreferLocalPlacement]
because I don't mind having two silos probing the same event hub partitions for 30 seconds until the stateless workers re-balance the periodic jobs and, correct me if I'm wrong, [StatelessWorker]
placement is more efficient (e.g. doesn't incur silo-to-silo communication) than [PreferLocalPlacement]
which verifies the grain isn't already activated. The downside here the periodic jobs won't be distributed based on the consistent hash ring, but based on naive modulus, since using management grain I can only access list of active silos, not their projection onto the hash ring.1) If you want to support changing number of silos, you can indeed periodically (infrequently, every 5 minutes) ask the management grain to # silos and adjust your math of X and Y. Better adjust only Y - the tick time and do not change the # grains. Much easier.
2) If you want to change target ops - just have another global rendezvous grain which you either periodically ask for target ops, or subscribe to this global rendezvous grain and let it notify all worker grains if target ops changes. Can use a stream for that as well, but does not have to. Can also just write it in known table in Azure Storage and periodically read it.
3) you don't need to worry about cost of activating PreferLocalPlacement
grain vs. StatelessWorker
grain, since you are going to activate a small number of those (like probably on the order of number of cores per silo) ONLY ONCE, at startup, so the cost is not even a consideration. The benefits are tremendous in our case: StatelessWorker
is not individually addressable, you cant notify them, its hard to control the number of them in the silo. It is simply the wrong abstraction in your case.
4) I am of a strong opinion that you don't need the internal ring. Not just that it is not needed in our case, it is actually a wrong primitive/abstraction for you. You can calculate EXACTLY the X and Y, while with ring you will have approximately.
Just keep it simple. No need to complicate.
The only thing that I see that can be improved on this design, is for us to support notifications from the management grain when new silo is added/removed. That will save the cost of pinging it every 5 minutes. That is easy: can have a global SMS stream and mgmt grain publishes to it.
Again - you don't need "consistent distribution" (which is what I previously called a ring). You have a much better and exact option: X and Y based on #silos and target ops. Consistent ring is needed in a different case, when the cost of transferring ownership on something is high and you want to minimize it in case if churn. In your case there is no ownership, just a simple work distribution.
Ohh, maybe what you meant in your problem is that you want to pin Azure Hubs to silos/grains? That is not what you wrote initially. You wrote just to do 50K ops. If you need to pin hubs to grains ("the cost of transferring ownership on something is high and you want to minimize it in case if churn), then its a different problem.
Gabi, if you have time, I'd appreciate talking over Skype or WhatsApp call. My mail is eldar1205@gmail.com so contact me if you'd like to do that. If not please read on.
[StatelessWorker]
instead of '[PreferLocalPlacement]` and not incur extra overhead of ensuring single activation?Thank you very much!
@Eldar1205
The periodic job is peeking an Azure event hub partition and check its last message sequence number
So you're not interested in reading the event hub events, you just need to check the last message's sequence number per partition?
It's quite possible there is an eventhub api that allows for this, but I am unaware of it. Only way I know to acquire an event hub partition's latest sequence number is to receive events from that partition.
Can you please elaborate on what you had in mind for partition management. How does the service know which partitions to read from? If new partitions can show up, something must be notifying the service of this, right?
What is reading from these hubs? If there is something reading from these hubs that can periodically notify your cluster of a partition's sequence number, that would simplify your service greatly.
How much activity do you expect on these partitions?
Each silo run an always active stateless worker that every 30 seconds asks....
I like this idea as well, but I'd modify it a bit. If there is something in the system that knows which partitions need to be checked (call it a partition tracker), each silo can have a set of stateless workers (10? 20?) which periodically start requesting buckets of partitions to check from the partition tracker. These grains periodically start requesting buckets of partitions, checks them, then repeat until partition tracker returns no more partitions. The partition tracker serves buckets of partitions to be checked, then moves those buckets to a list for the next check period. Once all of it's partitions have been checked, it returns nothing until it's time for the next window. If we're worried about the tracker dying, we can distribute the partitions among a number of trackers, and workers can round robin between them until none have work. Basically we don't divide up the work, we just periodically spin up workers and let them consume work until there is no more work to do.
200 silos, 50k partitions, 25 workers per silo, bucket size of 5. once a minute 5000 grains wake up 5000 grains request buckets consisting of a total of 25k partitions Each grain checks 5 partitions 5000 grains request buckets consisting of 25k partitions Each grain checks 5 partitions 5000 grains request buckets and get no more partitions. 5000 Grains go to sleep
200 silos, 50k partitions, 25 workers per silo, bucket size of 5. 10 silos die, 10k more partitions show up. 190 silos, 60k partitions once a minute 4750 grains wake up 4750 grains request buckets consisting of a total of ~23k partitions Each grain checks 5 partitions 4750 grains request buckets consisting of a total of ~23k partitions Each grain checks 5 partitions 2800 grains request buckets consisting of a total of ~14k partitions, rest get nothing 2800 grains checks 5 partitions, rest go to sleep. 2800 grains request buckets and get no more partitions. 2800 Grains go to sleep
2016-02-12 1:37 GMT+02:00 Jason Bragg notifications@github.com:
@Eldar1205 https://github.com/Eldar1205
The periodic job is peeking an Azure event hub partition and check its last message sequence number
So you're not interested in reading the event hub events, you just need to check the last message's sequence number per partition?
It's quite possible there is an eventhub api that allows for this, but I am unaware of it. Only way I know to acquire an event hub partition's latest sequence number is to receive events from that partition.
Can you please elaborate on what you had in mind for partition management. How does the service know which partitions to read from? If new partitions can show up, something must be notifying the service of this, right?
What is reading from these hubs? If there is something reading from these hubs that can periodically notify your cluster of a partition's sequence number, that would simplify your service greatly.
How much activity do you expect on these partitions?
Each silo run an always active stateless worker that every 30 seconds asks....
I like this idea as well, but I'd modify it a bit. If there is something in the system that knows which partitions need to be checked (call it a partition tracker), each silo can have a set of stateless workers (10? 20?) which periodically start requesting buckets of partitions to check from the partition tracker. These grains periodically start requesting buckets of partitions, checks them, then repeat until partition tracker returns no more partitions. The partition tracker serves buckets of partitions to be checked, then moves those buckets to a list for the next check period. Once all of it's partitions have been checked, it returns nothing until it's time for the next window. If we're worried about the tracker dying, we can distribute the partitions among a number of trackers, and workers can round robin between them until none have work. Basically we don't divide up the work, we just periodically spin up workers and let them consume work until there is no more work to do.
200 silos, 50k partitions, 25 workers per silo, bucket size of 5. once a minute 5000 grains wake up 5000 grains request buckets consisting of a total of 25k partitions Each grain checks 5 partitions 5000 grains request buckets consisting of 25k partitions Each grain checks 5 partitions 5000 grains request buckets and get no more partitions. 5000 Grains go to sleep
200 silos, 50k partitions, 25 workers per silo, bucket size of 5. 10 silos die, 10k more partitions show up. 190 silos, 60k partitions once a minute 4750 grains wake up 4750 grains request buckets consisting of a total of ~23k partitions Each grain checks 5 partitions 4750 grains request buckets consisting of a total of ~23k partitions Each grain checks 5 partitions 2800 grains request buckets consisting of a total of ~14k partitions, rest get nothing 2800 grains checks 5 partitions, rest go to sleep. 2800 grains request buckets and get no more partitions. 2800 Grains go to sleep
— Reply to this email directly or view it on GitHub https://github.com/dotnet/orleans/issues/1428#issuecomment-183107973.
There is an API to probe an event hub partition for new messages.
Nifty! Thanks!
I planned to have an Orleans client ask a grain to create an event hub... From the event hub partitions I expect to be silent most of the time, and every now and then have bursts of hundreds of messages per second
Please understand that I've been working with eventhub since before its release, and its performance characteristics have improved over time, so some of what I've experienced may no longer be the case. Having said that, from my experience, event hub acts quite poorly for sparse data flows. Much of it's performance comes from bulk operations, and without regular data flow, I've seen very poor performance, as well as common timeouts and other such unexpected behaviors. Is there a reason you need a single customer per hub/partition? Event hub has very high throughput, so at hundreds of events per second per user, you should be able to load 30-100 users onto a single partition, and many partitions per hub, depending on message size of course. At 50 users per partition, that would mean you'd only need 1k partitions.
Also, pre-allocating a set of hubs and partitions in advance, rather than creating them on the fly would allow you to use the streaming infrastructure, and avoid partition checks all together, as agents read as long as there is data, but when their is not, they delay checking for more data for a configured period of time. This means you can configure them to delay ~30 seconds whenever they read no data, which is, as I understand it, the behavior you want, correct? See GetQueueMsgsTimerPeriod in PersistentStreamProviderConfig.
- Your buckets idea is ...
Distributing data using silo availability creates a dependency on that information, and will only be as reliable as that information. In distributed systems, dumber algorithms that require less information tent to be more reliable. Stupid code breaks less often than smart code. :) The advantage of this type of 'hungry hippo' like algorithm is that it requires no coordination. Each grain just eats as long as there is food, sleeps, then does it again.
Orleans is simple awesome! I wish the reminders true performance characteristics would be documented, since they are much more useful than what documentation suggests.
So just to close on my answer from before: it appears that what you are actually asking is about resource allocation/job assignment, and not about work distribution (as it appeared initially in your question). The difference is that in job assignment you need to assign a certain job to a certain worker. Same with resource allocation - it is important where and to whom you allocate the resource. In your case the "resource" you are allocating is "responsibility to pull from event hub i".
As opposite to that, (simple) work distribution is when you have N identical jobs and you just need to do them in parallel, scale up their execution, without any restriction of pinning or assignment or allocation being "sticky".
My answer was how to do work distribution, as that what was implied you are asking about in your initial question. My answer does not hold for job assignment.
My apologies for not explaining myself better, but I do talk about work distribution. Probing the event hub partitions is simple work distribution: ~50K identical jobs, parallel, scaled, no restriction, as you said. Once a partition is detected to have messages it should be consumed using standard single activation grain, to benefit from Orleans interactive workload handling. Your answer did apply to work distribution, but you suggested reminders with 5 minutes period, which means it takes up to 6 minutes to recover from silo failure, which isn't acceptable according to imposed SLA. The recovery time should be 60-90 seconds. By reading reminders service code I deduced it can be used with sub-minute periods and gain all the benefits as long as the probe grains are stateless workers that are always activated locally by the reminder. That way by relying on reminders balancing and their efficient reliable implementation I can reliably balance my probe grains as required.
I did not suggest to use reminders. There is no need for them at all, in this simple pattern.
I saw you suggested reminders with 5 minutes period. To my knowledge that's the Orleans feature to run perodic work on the cluster surviving silo failure בתאריך 15 בפבר' 2016 0:42, "Gabriel Kliot" notifications@github.com כתב:
I did not suggest to use reminders. There is no need for them at all, in this simple pattern.
— Reply to this email directly or view it on GitHub https://github.com/dotnet/orleans/issues/1428#issuecomment-183995492.
Scenario: In my would-be Orleans based Azure cloud service, there would be a worker role with 200 instances that should execute 50K periodic jobs every 1 minute. The periodic job is peeking an Azure event hub partition and check its last message sequence number - simple O(1) operation in terms of CPU, memory & networking, an operation that there is no problem with it running occasionally on two silos at once. There are 50K of those partitions, most of the time silent without new messages.
Thoughts about solution: Reminders can't be used for that, since 1 minute interval is too short for them, and it's very important that the job executes every 1 minute even if some silos are shut down, which means internal timers alone won't help as well. Since those jobs are O(1), the optimal solution would be to find a way to distribute them between all active silos evenly. Orleans does that internally for internal jobs (e.g. reading from stream queues, executing reminders) using consistent hashing which adapts to silos joining/leaving the cluster. Is there a way to access this functionally? I'd like something like a stateless worker per silo that every 1 minute "queries" current silo which 50K/N jobs should current silo perform, where N is number of current active silos, and ask local placement grains to do those jobs, without incurring any additional messaging between different silos and relying on what Orleans already does for me. Think of it as "querying" the consistent hash ring maintained by all active silos and periodically adjusted with current active silos with sub-minute intervals. Since it's OK to peek the event hub partition from two silos at once, there are no worries about the hash rings not consistent between all active silos, as long as it's eventually consistent and work is eventually distributed evenly.
Notes: