Closed andrewvc closed 2 years ago
Paul had asked a question elsewhere:
Should we expose them all then in the UI ( browser.limit, tcp.limit, http.limit, icmp.limit, and heartbeat.scheduler), or is that too confusing?
I personally like the granularity, as it makes sense to define more concurrent lightweight checks (although they're individually defined) than browser checks, but this does add UI/UX complexity.
I'm +1 on exposing this all separately
Can I just check the behaviour, based on these two statement:
Agents within the policy, with this integration, would share load, splitting work between them.
and
It does not factor in load balancing
Does this mean that the user has no control over how the load balancing happens, but load balancing does happen (in that the monitors are shared across all agents in the policy, in which case, how is this done, half and half, or some other kind of sharding)?
@andrewvc has confirmed that with the approach (of passing the policy through to the Agent when the monitors are changed in Uptime), the policy will be applied to any/all agents assigned to that policy, so each would duplicate the workload (i.e. there is no sharing or sharding of the workload, all agents will run all monitors).
In realty, the expectation is that there should be a single Agent configured against the policy to prevent this double testing.
As part of this, the plan is to deprecate the current Synthetics Integration, so we only have one.
Question - what do we do about existing integrations that are set up when we do this (and upgrade the stack)? Are we expecting a migration (to the saved object version)? Assume we’d need the new to be integration installed by default to handle the index patterns (which are currently managed by the existing/legacy Synthetics integration)?
/cc @dominiqueclarke @andrewvc @drewpost
Thanks Paul. Even though the product is in beta, I think that providing an automated migration path for users of the soon to be deprecated version of the synthetics integration is the right thing to do.
In attendance: @andrewvc @paulb-elastic @joshdover @mostlyjason @dominiqueclarke
Synthetics to move forward with the above plan, with the caveat that users of Synthetics Node will require Fleet ALL permissions in the first iteration.
With regards to future iterations, @joshdover to perform a POC for RBAC evolution. This POC was originally focused on the Endpoint use case, but will now include the Synthetics use case as well. POC will explore allowing admins to create users with a "Manage Custom Synthetics Testing Nodes" permission that will provide read and write permissions only for the Elastic Synthetics Node Integration. As a stretch, the POC may also explore auto orchestrating that permission when Uptime ALL is selected. https://github.com/elastic/obs-dc-team/issues/731
@joshdover Also to explore the Endpoint solution which leverages fetching artifacts from Kibana in Fleet Server before packaging the configuration to be sent to Elastic Agent. Josh will help us explore if this implementation could work well for our use case.
Fleet RBAC design docs updated with Uptime use case discussed during the meeting
Also to explore the Endpoint solution which leverages fetching artifacts from Kibana in Fleet Server before packaging the configuration to be sent to Elastic Agent. Josh will help us explore if this implementation could work well for our use case.
Did some digging here and it seems that this like artifacts as they exist today aren't going to solve many problems here. Specifically, they won't allow a user without access to edit package policies to ship new monitors to custom locations running on Elastic Agent because they do require a policy change to instruct an agent to download an updated artifact.
We'll need to explore other options to support this use case. Some rough ideas:
Discussed with @andrewvc, Location Id
field should be added to the integration form to mimic service location's id
.
Edit: The requirement is to provide users with the possibility to define the id
for a custom location, so that it can be specified during the push command or from cli, instead of location's name. Same logic already applies to service locations.
Custom location ids should be prefixed somehow so that they cannot clash with service locations.
eg: us-central
vs US Central Location
I've added a list of ACs including what you've mentioned to the top level description @emilioalvap
Also to explore the Endpoint solution which leverages fetching artifacts from Kibana in Fleet Server before packaging the configuration to be sent to Elastic Agent. Josh will help us explore if this implementation could work well for our use case.
Did some digging here and it seems that this like artifacts as they exist today aren't going to solve many problems here. Specifically, they won't allow a user without access to edit package policies to ship new monitors to custom locations running on Elastic Agent because they do require a policy change to instruct an agent to download an updated artifact.
We'll need to explore other options to support this use case. Some rough ideas:
Change how artifacts work to enable artifact updates to be pushed to agents without requiring users have access to make direct edits to the package policies. Grant uptime users access to monitor artifacts.
- Endpoint (the only current user of artifacts today) may have a similar requirement. We should discuss this more in detail. Let me know who would like to join the next Fleet RBAC WG meeting next week so we can discuss this.
- Introduce an approval workflow that requires users with access to Synthetics package policies to approve monitor updates before deploying them to custom locations
- Make an exception in the RBAC model and allow Synthetics to make updates to package policies even if the end user doesn't have access
Hi @joshdover. Thanks for the updates regarding the use of artifacts. With regards to the other options we have to support this use case, is a finer tuned RBAC model that allows for users to have write access for specific packages only still an option? I see that some further investigation has been done for https://github.com/elastic/obs-dc-team/issues/731.
is a finer tuned RBAC model that allows for users to have write access for specific packages only still an option?
Yes that's definitely an option, but note that we're planning to implement that via Spaces authorization to the parent Agent policy. So if you wanted a user to have access to edit any monitors that should be run in custom locations, they'd also need to be granted Fleet access to all agent policies for a given Space.
@dominiqueclarke @joshdover I think that's reasonable for an initial approach in that we can document that. It also makes sense in that you might want a custom monitor location only available within one space, but not another.
I've added to the ACs:
Remove the restriction that Manage Monitors only appears for Cloud installations
As in 8.2, Manage monitors will only appear for Cloud installations
I think that's reasonable for an initial approach in that we can document that. It also makes sense in that you might want a custom monitor location only available within one space, but not another.
Sounds good. In that case, I think we can table the artifacts discussion for later and address the Synthetics use case when the time comes. I don't see any reason we couldn't provide a mechanism of some sort that supports the type of authz you all need, whether that's artifacts or something else.
I've added a new approach here that gives us load balancing after speaking with @joshdover and @ph today. It's under the section "Fleet / Agent Interaction - Take Two". Please take a look all, it's quite different than the original approach, but should not be much more work.
I've added an AC that Location ID should require a prefix of custom- to ensure it does not clash with hosted IDs
Is there the option to also allow tagging of the agent nodes to specify/filter on capabilities of the specific node?
e.g. tagging with 'saml' if the agent is deployed on a windows endpoint that can support synthetics for websites that require SAML authentication vs a linux endpoint that doesn't support SAML.
This could also assist with delineating agent endpoints that are deployed to different secure zones within a private network without having to potentially leak details via the location name.
A couple of notes on the action-based load balancing approach being discussed:
.fleet-actions
or .fleet-action-results
indices filling up with too many events after a long period of time. I'm not sure we have another actions use case that would run as frequently as Uptime's need here and want to make sure we're ready to handle the scale.I've tried to writeup more complete thoughts on this before I go on vacation. I hope this is complete enough. @joshdover if you jump to the "Messaging overhead and perf concerns" section you'll find some back of the envelope math that may be helpful, but I'd love your thoughts on this full comment. In short, I'm not too worried about perf. Also, ++ on not defining an API for now, I think we may want to use bulk APIs, and possibly update_by_query
.
There are a lot of ways we could distribute work to agents. In short, we can build something simple, but less scalable, or complex but more scalable. Not much of a surprise there.
I've got ideas here in two broad categories:
There are fundamental design tension here is around reducing hotspots, and balancing work both across agents and across time, as well as messaging overhead.
The synthetics service opts to reduce messaging/scheduling overhead for lightweight monitors, because the unit of work is so small spending more effort to coordinate them more optimally makes less sense, and spending more effort on messaging for heavier browser monitors.
Let's examine solutions given that:
The heartbeat scheduler algorithm today is not perfect, please keep in mind some of the ideas mentioned here are actually superior to that algorithm.
The algorithm today is quite simple, heartbeat starts up, reads its list of monitors, and schedules all monitors to run immediately. It makes no attempt to spread out execution or stagger jobs. Some of the proposals that follow make some provision to improve that, though this is not required for an initial implementation.
Heartbeat does let you cap maximum concurrent jobs by type, but doesn't stop you from overprovisioning monitors at all. It will do its best to run them, and if you overprovision it, you will see missed monitors etc. This was less apparent with lightweight monitors since they were so hard to overprovision, but is more apparent with browser based ones.
Imagine we have 3 different jobs, each set to run every 5 minutes, and two agents with two execution slots each. A visualization of their execution might look like the following given a naive approach where we:
This is essentially what heartbeat does today. If a slot is overfilled, we just try to run it ASAP. I should mention that the heartbeat scheduler will run jobs late if execution capacity is met, effectively shifting them forward, helping spread jobs around, which is not captured here.
This works fine so long as we have more slots than jobs. However, it leaves the workers temporaly underutilized if they run at different schedules in particular. This scheme works out to one slot per monitor.
In this approach we would:
This banks on the observed phenomenon that users typically only have 2-3 unique schedules in their infrastructure. It does have drawbacks however in that:
Since some jobs are offset temporaly you face a problem of what to do when a user creates a new job that runs every 15m say, when's the first run? In this case it'd be make sense to send a 'run once' message to run it right away, but after that run it on schedule, which should be within 15m.
In this approach we let kibana coordinate all the work, and send a message for each invocation of the job. This makes more sense for heavier browser jobs that vary quite a bit in execution time. This approach dynamically responds to the available capacity of the agents since it tries to use whatever resources are free at a given time.
This works by running a periodic kibana background job that, on each invocation does the following:
A pseudocode implementation is shown below:
sample_monitor_saved_object = {
id: "my-monitor",
schedule: "@every 60s"
dispatch_next_at: "2022:04:07T04:00:01" # Exact time this monitor should next run
dispatched_at: "2022:07T03:55:01" # Set if the monitor is currently executing to the time it was dispatched
dispatched_to_agent: "my-agent-id" # ID of the agent this monitor is running on
# ... other monitor fields
}
# Two jobs, one that dispatches stuff, one that reads responses
schedule_kibana_background_job(&:dispatch_jobs, 30.seconds)
schedule_kibana_background_job(&:read_responses, 1.minute)
# Tracks how much usage each agent is seeing
# This is modeled as an implementation non-specific class, but it could be either
# in memory or backed by an ES document
agent_utilization = AgentUtilizationTracker.new
# Dispatches all jobs that are due to be scheduled to the relevant agents
def dispatch_jobs()
monitors = saved_objects.
get_all_uptime_monitors().
# filter out monitors that are already running, we could do more involved error
# handling here in the final product, we should probably forcibly terminate monitors longer than their schedule.
# and log this if it does happen
filter {|m| m.dispatched_at == nil}
filter {|m| m.dispatch_next_at <= now()} # Just get ones that need to run
locations_monitors = monitors.group_by(&:location_name)
locations_monitors.each do |location, monitors|
actions = [] # Actions to be performed in this location
synth_node_integration = fleet.get_integration_for_location(location)
agents = synth_node_integration.fleet_policy.agents.filter(&:status_up)
# Send a message to all agents in this location, blocking waiting for their response to see their current utilization
agent_utilization = get_agent_utilization(agents)
monitors.each do |monitor|
# Get the least buys agent and increment its utilization by 1
agent = agent_utilization.get_least_busy_and_increment()
# Could also just be one action for all the grouped monitors, not sure if the tradeoff for fewer bigger docs
# makes sense, but it might
# Note that we set 'overscheduled' to true if we've been forced to send more monitors to this agent
# than it has capacity for, in which case the agent will queue this monitor and run it a little late
actions.push({action: "run_monitor", agent: agent, monitor: monitor, overscheduled: agent.utilization_exceeded ? true : false })
end
# Use the bulk index API to send all the actions for one location at once for efficiency
es.bulk_index_actions(actions)
# Run a bulk update_by_query on saved objects to set dispatched_at, dispatch_next_at and dispatched_to_agent
# If utilization_exceeded+
es.bulk_update_next_runs(actions)
end
end
# Reads responses from the dispatched agent jobs and updates
def read_responses
responses = es.read_get_and_delete_all_responses()
responses.each do |r|
# I'm actually not sure what to do with responses. On the happy path we can just throw them away
# If the monitor encounters an error executing that's already handled
# The only real use for these is errors maybe within fleet, or if heartbeat can't even create the
# monitor. In that case I'd probably just log them since they should be extremely rare.
end
end
This work queue requires one message, or ES document per job invocation, possibly two if you include the response (though I think we could skip responses for job invocation messages). Let's remember that heartbeat still needs to write the results to the cluster too, a minimum of one doc (a much larger doc usually than the job invocatio).
For browser based jobs the messaging overhead is by definition neglible, dwarfed by the multitude of results each browser job stores in ES. In ligthweight jobs the ration is different, in the worst case this requires a doubling of write capacity, perhaps a trippling if you count deleting the messages, for ligthweight jobs. Storage capacity should be unaffected since these fleet actions are not retanied.
Let's take the example of a very large scale user with say 15000 lightweigth jobs and 5000 browser jobs scheduled to run every 5m, this works out to 66 jobs/second. Regardless of scheduling they need a cluster that can handle at least 44 ligthweight result docs, plus at let's say 300 docs per browser run, (300*22)=6600 browser docs. In this scenario their baseline is 6644 docs / second. The overhead of messaging is maybe only 198 docs/s extra. They already need to have capacity to account for quite a bit of throughput here, the additional reqs are a rounding error.
There may be more concern about kibana perf here, specifically around deserializing monitors, especially encrypted fields. However, we should benchmark that, and see what our true throughput is here. If it's a problem we can likely find a way to only send the full config when seeing an agent for the first time.
For a first pass I suggest we use the work queue. Yes, it introduces messaging overhead for lightweight monitors, but it solves the more serious problem of browser monitor scheduling. Let's remember, most users have double digit numbers of monitors, around 40 or so, where overhead is small. For larger customers I would wager this will still work for most common scales of operation, esp. given that users usually mix lightweigth and browser monitors, and a single browser monitor is a large multiple of a single lightweight monitor.
Finally, we could look into fancier ways to pre-plan schedules for the lightweight monitors, like the segmented slots proposal or even fancier things, but only if it actually is needed. These methods would severely reduce messaging overhead, at the cost of more code complexity.
Thanks for the detailed writeup @andrewvc. I suspect the message overhead of the basic work queue approach to be negligible as well. Likely the most valuable benefit of the other approaches is that it removes Kibana (and Task Manager) from the critical path of executing monitors. When Kibana is down or Task Manager / Alerting is being heavily used, the task running to schedule Uptime monitors will be backed up as well, which will delay running monitors.
This concern is slightly 'mitigated' by the fact that if alerting is slowed down, users aren't going to receive alerts about their down monitors, regardless of whether or not Uptime monitors are executing fast enough.
Another way to solve this could be to introduce a concept of task priority to the task manager queue in Kibana, which has not yet been solved: https://github.com/elastic/kibana/issues/75041
@andrewvc thanks for such a detailed write-up! Before I go on holidays I thought I'd also outline yet one more approach which came to mind after our last discussion: create a pull-based mechanism from the Hearbeat side, rather than pushing work through Kibana.
In this comment, I'll explain how this approach would work, and why I think it might be a good one. Then, I'll relate this approach back to the constraints and ACs you outlined previously.
The pull-based mechanism would involve more substantial changes to Heartbeat than to Kibana.
It consists of Heartbeat instances pulling work from the Kibana/ES queue whenever they're free. Such an implementation turns Heartbeat into a "consumer" which actively looks for things to do, rather than being told what to do next.
The item(s) to be executed by Heartbeat are provided by a Kibana API which will always return the due jobs with the highest priority.
➡️ Note that the optimal number of jobs to pull per request is, theoretically, 1. That's because by pulling one job at a time we ensure that the next job will only be pulled as early as possible when an agent is available. Imagine, for example, that Agent A pulls two jobs X and Y. If Agent B becomes free while agent A is executing X, Y would be executed later than it could have been if it was still in the queue when Agent B became available. We can, however, reduce the number of requests if we decide that it's worth pulling more jobs than 1, especially if we can increase the number of parallel monitor runs.
To determine which monitors are due to run, we consider a monitor's "acceptable execution window", not its execution time. Whenever a monitor is pulled, its "acceptable execution window" grows by <SCHEDULE>
seconds.
At the time Hearbeat instances ask for jobs, Monitors whose the "acceptable execution window end" is greater than the current time are considered "due to run".
Priorities are defined considering both a monitor's schedule and how long it takes to run.
We calculate priorities that way because it's better for a @every 30m
monitor to be delayed 30s
than for a @every 1m
monitor to be delayed 30s. In other words, the cost of the delay is relative to the monitor's schedule.
For the sake of this example we'll calculate the cost of delay by dividing
1 / schedule (seconds)
, which gives us the cost of delay per second.
Considering monitors with the shortest schedule are the most expensive ones to delay, in order to optimise value per second of execution we run the quickest monitors which have the highest cost of delay first.
As a visual representation of how WSFJ reduces CoD per time, I like this example from one of Donald Reinertsen's books:
Imagine, for example, that you have the following monitors:
@every 30s
(CoD/s: 0.033) / Duration: 20s
/ WSJF: CoD/Duration = 0.00165@every 60s
(CoD/s: 0.016) / Duration: 15s
/ WSJF: CoD/Duration = 0.00106@every 15s
(CoD/s: 0.06 )/ Duration: 30s
/ WSJF: CoD/Duration = 0.002In that case, the optimal priority order for these jobs is C, A, B.
Now, if you consider the fact that monitors are only considered due to run at the end of their "acceptable execution window", this mechanism simply ensures we make the most out of the available resources in terms of reducing the cost per delay of a monitor.
👉 Even if we have multiple monitors with the very same schedules, we only guarantee they'll run within the acceptable execution window, not when within that window they're executed.
👉 We know
Duration
of a monitor because we run it immediately once it's created (monitors that never run get the highest priority). Therefore, I'd recommend us to always have at least 2 threads of execution to prevent a very long monitor from delaying others.👉 Priority is calculated at the time of the request.
Now, revisiting the ACs, I think this approach fulfils all of them:
👉 A job will always run within its window of execution, no matter when. A job could run, for example, right at the end of its window and then right at the start of the next window. Running it twice within the same window would be incorrect IMO, so I consider this AC to be ✅ . Furthermore, with more than 1 execution thread we could easily even start the next run while the first one is still in progress.
This is not necessary for the scheduling algorithm, so I've marked it as yellow because I think it's not directly related, although I do think a "killswitch" at 15m would be useful. I don't think we should use the schedule interval though because with multiple threads we could have monitors which take more than their schedule time and still run within their acceptable time windows always.
Definitely a yes given this is pull-based, not push-based.
Any algorithm would only be able to achieve this item if capacity > workload. Considering capacity = workload then this algorithm does run monitors consistently.
👉 Heartbeat pulls jobs rather than having Kibana push them.
Also worth noticing that:
Thanks for the thorough writeup @lucasfcosta ! I agree that there's a lot to be said for a pull vs. push architecture, and TBH I'd generally favor that myself. It's how most work queues are implemented.
The key reason for going with the push approach was to simplify network configuration, the push approach can work atop the existing fleet server comms channel. With a pull approach users need to make sure heartbeat instances have access to kibana, which they may not. It's another source of potential issues, network ACL configuration etc.
It does make me wonder if we can work in reverse in fleet. @joshdover @ph , could heartbeat initiate comms with kibana rather than the other way around?
Thanks for the detailed overview @andrewvc! I've gone through the proposal a few times already and I just have a couple of points that I might not have picked from your comment:
overprovisioned
flag in a scenario where total monitor execution time > total cpu available for a given location, how are we going to handle snowballing the local queue? Should we flush it on every message received? @emilioalvap what is the overprovisioned
flag? I don't see that word elsewhere here, but I think I understand what you mean? At any rate, we could make sure the queue holds at most one message per monitor ID at a time. So, if one message is missed, subsequent ones don't pile up. Since everything in ES is easily indexed this should be easy, and could possibly be done even more optimally by encoding the monitor ID deterministically into the doc ID.
WRT the second question about running the task on demand, I think it comes down to benchmarking how efficiently it can run. If we can run it every, say 5s without any issue (since it'd usually be a noop) maybe we just do that? Otherwise, yes, having a way to just invoke it as a function makes sense with a global lock most likely (using a doc as a lock )
@andrewvc Sorry about that, I meant overscheduled
. You answered it anyway.
With this approach, will monitor configuration and source code still be added to fleet integration policy or sent to each individual agent on each iteration? Do these "action" documents need to be encrypted as well?
That's an excellent question about encryption. I'm not entirely sure what the best approach here is. Some thoughts:
We should look into what security guarantees there are on this data at the moment. I assume these docs are already highly restricted, I don't see us needing much more in terms of encryption than the existing saved objects have. I'd expect those indices to already be highly restricted. @joshdover @ph thoughts?
We wouldn't want to share any secrets kibana uses, we may have to re-encrypt anyway to satisfy encryption for data at rest, but then that raises the question of where to store the key.
The most paranoid thing we could do here would be use a public key from the intended heartbeat to encrypt stuff one way into the messages. I don't know if it'd be worth it to go even further and implement DH key exchange or one of its successors. It'd be interesting to see what the perf impact of this approach would be since most crypto protocols use symmetric encryption for perf with a key exchange.
Looking at these two options, approach B has the easiest to reason about security surface area. Any thoughts on your end @emilioalvap ?
After discussing this extensively with @paulb-elastic and @drewpost we've decided to revert to @dominiqueclarke 's original design.
While it would be nice to do load balancing etc. from the get go, it's not worth the extra effort at this point, we have other higher priority items on the roadmap to tackle. I've updated the top comment accordingly.
@andrewvc FYI https://github.com/elastic/ingest-dev/issues/1000 concerning security/credentials.
Final decision for this is to call it Private Location
Closing in favor of https://github.com/elastic/uptime/issues/475
We'd like to shift away to a new type of synthetics integration termed "Synthetics Node", where each installation of the integration represents a new custom monitor location rather than a discreet monitor.
Jump below the ACs for a narrative description:
ACs
custom-
to ensure it does not clash with hosted IDsThe fleet UI would look something like:
Users would create an integration policy per custom location, and add multiple agents to each policy.
Note that the actual sites being monitored are stored by the Uptime app in standard Kibana saved objects outside of fleet. Fleet defines locations, Uptime defines sites being monitored.
Approach