[Question] Recommendation for long running service-style task

Here are some thoughts:

Use Fenzo's tiered queues to define priority tiers for service versus batch tasks. We actually created two tiers - one for "critical" tasks that need to be launched right away (and most service style tasks fit into this tier), and one for "flex" tasks that have flexible needs for how quickly they need to be launched. Fenzo will assign resources to tasks in critical tier before considering assignments for tasks in the lower tier, flex. Note that tiers in Fenzo are numbered 0 to N-1 for N tiers. For us, critical is tier 0 and flex is tier 1.
This does not, however, prevent the case of the cluster being saturated with lower tier tasks. In which case, a new task in the higher tier will have to wait until resources are made available due to completion of some tasks (e.g., batch tasks eventually complete).
In the future, we will be introducing preemptions to ensure that the higher tier tasks can get resources immediately by terminating some lower tier tasks.
Currently, we take the approach of guaranteeing resources for each tier using different set of agents. So, say, a set of agents are "earmarked" for tier 0 and a different set of agents are earmarked for tier 1. We do this by setting a Constraint that ensures tasks of a certain tier go to its preferred agents. This also allows us to ensure there is sufficient capacity for each tier, separately.
We create separate sets of agents using the autoscale groups in Fenzo. See, AutoScaleByAttribute settings in the TaksScheduler's Builder class.

I spoke about capacity guarantees recently at QCon San Francisco. The slides are available here. The video should be available later from QCon.

Netflix / Fenzo

[Question] Recommendation for long running service-style task #108