andrewvc commented 2 years ago

We'd like to shift away to a new type of synthetics integration termed "Synthetics Node", where each installation of the integration represents a new custom monitor location rather than a discreet monitor.

Jump below the ACs for a narrative description:

ACs

User can create synthetics node integration including location details
User can perform checks from that location in the uptime app
Users cannot perform "run once" actions against custom nodes
Fields are:
- Location Name
- Location ID (autofilled as user types name?)
- Location ID should require a prefix of custom- to ensure it does not clash with hosted IDs
- Location lat/lon
- Maximum concurrent monitors broken out by type (Browser / HTTP / TCP / ICMP)
Remove the restriction that Manage Monitors only appears for Cloud installations

The fleet UI would look something like:

Users would create an integration policy per custom location, and add multiple agents to each policy.

Note that the actual sites being monitored are stored by the Uptime app in standard Kibana saved objects outside of fleet. Fleet defines locations, Uptime defines sites being monitored.

Approach

Per @dominiqueclarke 's design research:

Two things to note about this alternative approach.

It does not factor in load balancing

It is able to be achieved with existing Fleet/Agent architecture

The underlying UI of the Integration would rename the same, minus the field for maximum concurrent monitors.

The underlying spec for the the integration would be as follows:
- name: name
  type: string
  title: name
  multi: false
  required: false
  show_user: true
- name: geo
  type: yaml
  title: Geo location data for the monitor
  multi: false
  required: true
  show_user: true
- name: monitors
  type: yaml
  title: Monitors to run on agent
  multi: true
  required: false
  show_user: false
Where the first two fields relate to the node details, and the third is a list of monitors.

The first two fields would be editable in the integration. The monitors value would be entirely controlled by the Uptime UI.

The Monitor Management UI would then query for both the locations available from the service as well as integration policies of type "Synthetics Node" and display both options in the locations ComboBox.

When a "Synthetics Node" location is selected, Uptime would update the existing integration policy by adding that monitor to the monitors array. Uptime remains the single source of truth for monitors, but now has two consumers of that data, the Synthetics service as the "Synthetics Node" integration, both of which keep track of copies of the monitor configuration they are responsible for.

When a monitor is edited, if a "Synthetics Node" location is removed, that monitor is removed from the integration policy configuration. If the location remains but the underlying configuration changes, the integration policy is updated with that configuration change.

Heartbeat would need a patch to read the monitors key off the policy and pull the monitor configuration from there.

This architecture allows us to move quickly with a smaller lift and 0 changes from the Fleet and Agent team. However, it is important to note that this solution does not consider load balancing. We can design that later.

See also: https://github.com/elastic/uptime/issues/432

As per https://github.com/elastic/uptime/issues/441#issuecomment-1049780836, there is also a requirement to migrate existing Synthetics integration policies to the new type.

~

andrewvc commented 2 years ago

Paul had asked a question elsewhere:

Should we expose them all then in the UI ( browser.limit, tcp.limit, http.limit, icmp.limit, and heartbeat.scheduler), or is that too confusing?

I personally like the granularity, as it makes sense to define more concurrent lightweight checks (although they're individually defined) than browser checks, but this does add UI/UX complexity.

I'm +1 on exposing this all separately

paulb-elastic commented 2 years ago

Can I just check the behaviour, based on these two statement:

Agents within the policy, with this integration, would share load, splitting work between them.

and

It does not factor in load balancing

Does this mean that the user has no control over how the load balancing happens, but load balancing does happen (in that the monitors are shared across all agents in the policy, in which case, how is this done, half and half, or some other kind of sharding)?

paulb-elastic commented 2 years ago

@andrewvc has confirmed that with the approach (of passing the policy through to the Agent when the monitors are changed in Uptime), the policy will be applied to any/all agents assigned to that policy, so each would duplicate the workload (i.e. there is no sharing or sharding of the workload, all agents will run all monitors).

In realty, the expectation is that there should be a single Agent configured against the policy to prevent this double testing.

paulb-elastic commented 2 years ago

As part of this, the plan is to deprecate the current Synthetics Integration, so we only have one.

Question - what do we do about existing integrations that are set up when we do this (and upgrade the stack)? Are we expecting a migration (to the saved object version)? Assume we’d need the new to be integration installed by default to handle the index patterns (which are currently managed by the existing/legacy Synthetics integration)?

/cc @dominiqueclarke @andrewvc @drewpost

drewpost commented 2 years ago

Thanks Paul. Even though the product is in beta, I think that providing an automated migration path for users of the soon to be deprecated version of the synthetics integration is the right thing to do.

dominiqueclarke commented 2 years ago

Summary from 3/15/12 Meeting

In attendance: @andrewvc @paulb-elastic @joshdover @mostlyjason @dominiqueclarke

Synthetics to move forward with the above plan, with the caveat that users of Synthetics Node will require Fleet ALL permissions in the first iteration.

With regards to future iterations, @joshdover to perform a POC for RBAC evolution. This POC was originally focused on the Endpoint use case, but will now include the Synthetics use case as well. POC will explore allowing admins to create users with a "Manage Custom Synthetics Testing Nodes" permission that will provide read and write permissions only for the Elastic Synthetics Node Integration. As a stretch, the POC may also explore auto orchestrating that permission when Uptime ALL is selected. https://github.com/elastic/obs-dc-team/issues/731

@joshdover Also to explore the Endpoint solution which leverages fetching artifacts from Kibana in Fleet Server before packaging the configuration to be sent to Elastic Agent. Josh will help us explore if this implementation could work well for our use case.

Fleet RBAC design docs updated with Uptime use case discussed during the meeting

joshdover commented 2 years ago

Also to explore the Endpoint solution which leverages fetching artifacts from Kibana in Fleet Server before packaging the configuration to be sent to Elastic Agent. Josh will help us explore if this implementation could work well for our use case.

Did some digging here and it seems that this like artifacts as they exist today aren't going to solve many problems here. Specifically, they won't allow a user without access to edit package policies to ship new monitors to custom locations running on Elastic Agent because they do require a policy change to instruct an agent to download an updated artifact.

We'll need to explore other options to support this use case. Some rough ideas:

Change how artifacts work to enable artifact updates to be pushed to agents without requiring users have access to make direct edits to the package policies. Grant uptime users access to monitor artifacts.
- Endpoint (the only current user of artifacts today) may have a similar requirement. We should discuss this more in detail. Let me know who would like to join the next Fleet RBAC WG meeting next week so we can discuss this.
Introduce an approval workflow that requires users with access to Synthetics package policies to approve monitor updates before deploying them to custom locations
Make an exception in the RBAC model and allow Synthetics to make updates to package policies even if the end user doesn't have access

emilioalvap commented 2 years ago

Discussed with @andrewvc, Location Id field should be added to the integration form to mimic service location's id.

Edit: The requirement is to provide users with the possibility to define the id for a custom location, so that it can be specified during the push command or from cli, instead of location's name. Same logic already applies to service locations. Custom location ids should be prefixed somehow so that they cannot clash with service locations.

eg: us-central vs US Central Location

andrewvc commented 2 years ago

I've added a list of ACs including what you've mentioned to the top level description @emilioalvap

dominiqueclarke commented 2 years ago

Also to explore the Endpoint solution which leverages fetching artifacts from Kibana in Fleet Server before packaging the configuration to be sent to Elastic Agent. Josh will help us explore if this implementation could work well for our use case.

Did some digging here and it seems that this like artifacts as they exist today aren't going to solve many problems here. Specifically, they won't allow a user without access to edit package policies to ship new monitors to custom locations running on Elastic Agent because they do require a policy change to instruct an agent to download an updated artifact.

We'll need to explore other options to support this use case. Some rough ideas:

Change how artifacts work to enable artifact updates to be pushed to agents without requiring users have access to make direct edits to the package policies. Grant uptime users access to monitor artifacts.

Endpoint (the only current user of artifacts today) may have a similar requirement. We should discuss this more in detail. Let me know who would like to join the next Fleet RBAC WG meeting next week so we can discuss this.

Introduce an approval workflow that requires users with access to Synthetics package policies to approve monitor updates before deploying them to custom locations

Make an exception in the RBAC model and allow Synthetics to make updates to package policies even if the end user doesn't have access

Hi @joshdover. Thanks for the updates regarding the use of artifacts. With regards to the other options we have to support this use case, is a finer tuned RBAC model that allows for users to have write access for specific packages only still an option? I see that some further investigation has been done for https://github.com/elastic/obs-dc-team/issues/731.

joshdover commented 2 years ago

is a finer tuned RBAC model that allows for users to have write access for specific packages only still an option?

Yes that's definitely an option, but note that we're planning to implement that via Spaces authorization to the parent Agent policy. So if you wanted a user to have access to edit any monitors that should be run in custom locations, they'd also need to be granted Fleet access to all agent policies for a given Space.

andrewvc commented 2 years ago

@dominiqueclarke @joshdover I think that's reasonable for an initial approach in that we can document that. It also makes sense in that you might want a custom monitor location only available within one space, but not another.

paulb-elastic commented 2 years ago

I've added to the ACs:

Remove the restriction that Manage Monitors only appears for Cloud installations

As in 8.2, Manage monitors will only appear for Cloud installations

joshdover commented 2 years ago

I think that's reasonable for an initial approach in that we can document that. It also makes sense in that you might want a custom monitor location only available within one space, but not another.

Sounds good. In that case, I think we can table the artifacts discussion for later and address the Synthetics use case when the time comes. I don't see any reason we couldn't provide a mechanism of some sort that supports the type of authz you all need, whether that's artifacts or something else.

andrewvc commented 2 years ago

I've added a new approach here that gives us load balancing after speaking with @joshdover and @ph today. It's under the section "Fleet / Agent Interaction - Take Two". Please take a look all, it's quite different than the original approach, but should not be much more work.

andrewvc commented 2 years ago

I've added an AC that Location ID should require a prefix of custom- to ensure it does not clash with hosted IDs

axrayn commented 2 years ago

Is there the option to also allow tagging of the agent nodes to specify/filter on capabilities of the specific node?

e.g. tagging with 'saml' if the agent is deployed on a windows endpoint that can support synthetics for websites that require SAML authentication vs a linux endpoint that doesn't support SAML.

This could also assist with delineating agent endpoints that are deployed to different secure zones within a private network without having to potentially leak details via the location name.

joshdover commented 2 years ago

A couple of notes on the action-based load balancing approach being discussed:

An estimation of the potential number of actions/second produced by this on larger clusters would be helpful. We're engaging in some in-depth scale testing of Fleet/Agent and we could use this calculation to inform our performance targets.
We need to evaluate whether or not there are any problems with the .fleet-actions or .fleet-action-results indices filling up with too many events after a long period of time. I'm not sure we have another actions use case that would run as frequently as Uptime's need here and want to make sure we're ready to handle the scale.
The Fleet plugin in Kibana does not yet provide any JavaScript or HTTP APIs for creating actions and getting results. The Fleet team is exploring a similar mechanism as is described here, but to support progressive rollouts of actions to agents. There may be some common code that could be useful here to support both use cases. However, there are some fundamental differences and I think it'd be better for each team to solve their use case with a specific implementation first and then see where it makes sense to generalize, if at all.
You can find examples of how to create actions in the OSQuery plugin: https://github.com/elastic/kibana/blob/748814bec1e215f8c7a7216e808fec4d0ede13e6/x-pack/plugins/osquery/server/routes/action/create_action_route.ts#L83-L105

andrewvc commented 2 years ago

I've tried to writeup more complete thoughts on this before I go on vacation. I hope this is complete enough. @joshdover if you jump to the "Messaging overhead and perf concerns" section you'll find some back of the envelope math that may be helpful, but I'd love your thoughts on this full comment. In short, I'm not too worried about perf. Also, ++ on not defining an API for now, I think we may want to use bulk APIs, and possibly update_by_query.

There are a lot of ways we could distribute work to agents. In short, we can build something simple, but less scalable, or complex but more scalable. Not much of a surprise there.

I've got ideas here in two broad categories:

Low messaging overhead - we pre-plan a grid-type schedule and send it to agents to run indefinitely
High messaging overhead - kibana acts more like a central queue, sending out work when needed to which ever nodes are least busy

There are fundamental design tension here is around reducing hotspots, and balancing work both across agents and across time, as well as messaging overhead.

The synthetics service opts to reduce messaging/scheduling overhead for lightweight monitors, because the unit of work is so small spending more effort to coordinate them more optimally makes less sense, and spending more effort on messaging for heavier browser monitors.

Constraints / ACs

Let's examine solutions given that:

Jobs to run at the specified interval under normal operation
If needed jobs could run more frequently (an extra run here or there) if necessary (uptime handles this correctly)
Jobs can take varying amounts of time, but do have a maximum amount of runtime, let's say 15m OR their schedule interval, whichever is lower.
We want to make the most of our execution capacity
Monitors should run consistently at the interval specified

Performance considerations and limitations of the heartbeat scheduler today

The heartbeat scheduler algorithm today is not perfect, please keep in mind some of the ideas mentioned here are actually superior to that algorithm.

The algorithm today is quite simple, heartbeat starts up, reads its list of monitors, and schedules all monitors to run immediately. It makes no attempt to spread out execution or stagger jobs. Some of the proposals that follow make some provision to improve that, though this is not required for an initial implementation.

Heartbeat does let you cap maximum concurrent jobs by type, but doesn't stop you from overprovisioning monitors at all. It will do its best to run them, and if you overprovision it, you will see missed monitors etc. This was less apparent with lightweight monitors since they were so hard to overprovision, but is more apparent with browser based ones.

Low messaging overhead - naive approach

Imagine we have 3 different jobs, each set to run every 5 minutes, and two agents with two execution slots each. A visualization of their execution might look like the following given a naive approach where we:

Execute all jobs at the same time
Distribute the jobs evenly across slots

This is essentially what heartbeat does today. If a slot is overfilled, we just try to run it ASAP. I should mention that the heartbeat scheduler will run jobs late if execution capacity is met, effectively shifting them forward, helping spread jobs around, which is not captured here.

This works fine so long as we have more slots than jobs. However, it leaves the workers temporaly underutilized if they run at different schedules in particular. This scheme works out to one slot per monitor.

Low messaging overhead - schedule segmented slots

In this approach we would:

Assign each execution slot a schedule
Only assign monitors of identical schedules to it, consistently hashing the monitors by ID to ensure they execute at the same time + offset
If a slot if oversubscribed look for under-utilized slots of a different schedule that would fit it (bonus optimization)

This banks on the observed phenomenon that users typically only have 2-3 unique schedules in their infrastructure. It does have drawbacks however in that:

It's hard to understand why some slots are underutilized, hard to reason about much spare capacity exists
Adding a single monitor with a unique schedule occupies a full slot

Since some jobs are offset temporaly you face a problem of what to do when a user creates a new job that runs every 15m say, when's the first run? In this case it'd be make sense to send a 'run once' message to run it right away, but after that run it on schedule, which should be within 15m.

High messaging overhead - central work queue

In this approach we let kibana coordinate all the work, and send a message for each invocation of the job. This makes more sense for heavier browser jobs that vary quite a bit in execution time. This approach dynamically responds to the available capacity of the agents since it tries to use whatever resources are free at a given time.

This works by running a periodic kibana background job that, on each invocation does the following:

Checks with all agents to determine how much capacity they currently have via messaging
Looks for jobs due to be scheduled and assigns them one by one to the least busy agent
One optimization could be adding random jitter to monitors, so that if a user schedules 50 monitors that run every 5m all at once, some start their 5m cycle later than others so the jobs don't clash too much.

A pseudocode implementation is shown below:

sample_monitor_saved_object = {
  id: "my-monitor",
  schedule: "@every 60s"
  dispatch_next_at: "2022:04:07T04:00:01" # Exact time this monitor should next run
  dispatched_at: "2022:07T03:55:01" # Set if the monitor is currently executing to the time it was dispatched
  dispatched_to_agent: "my-agent-id" # ID of the agent this monitor is running on
  # ... other monitor fields
}

# Two jobs, one that dispatches stuff, one that reads responses
schedule_kibana_background_job(&:dispatch_jobs, 30.seconds)
schedule_kibana_background_job(&:read_responses, 1.minute)

# Tracks how much usage each agent is seeing
# This is modeled as an implementation non-specific class, but it could be either
# in memory or backed by an ES document
agent_utilization = AgentUtilizationTracker.new

# Dispatches all jobs that are due to be scheduled to the relevant agents
def dispatch_jobs()
  monitors = saved_objects.
    get_all_uptime_monitors().
    # filter out monitors that are already running, we could do more involved error
    # handling here in the final product, we should probably forcibly terminate monitors longer than their schedule.
    # and log this if it does happen
    filter {|m| m.dispatched_at == nil}
    filter {|m| m.dispatch_next_at <= now()} # Just get ones that need to run

  locations_monitors = monitors.group_by(&:location_name)
  locations_monitors.each do |location, monitors|  
    actions = [] # Actions to be performed in this location
    synth_node_integration = fleet.get_integration_for_location(location)
    agents = synth_node_integration.fleet_policy.agents.filter(&:status_up)
    # Send a message to all agents in this location, blocking waiting for their response to see their current utilization
    agent_utilization = get_agent_utilization(agents)

    monitors.each do |monitor|
      # Get the least buys agent and increment its utilization by 1
      agent = agent_utilization.get_least_busy_and_increment()
      # Could also just be one action for all the grouped monitors, not sure if the tradeoff for fewer bigger docs
      # makes sense, but it might
      # Note that we set 'overscheduled' to true if we've been forced to send more monitors to this agent
      # than it has capacity for, in which case the agent will queue this monitor and run it a little late
      actions.push({action: "run_monitor", agent: agent, monitor: monitor, overscheduled: agent.utilization_exceeded ? true : false })
    end

    # Use the bulk index API to send all the actions for one location at once for efficiency
    es.bulk_index_actions(actions)
    # Run a bulk update_by_query on saved objects to set dispatched_at, dispatch_next_at and dispatched_to_agent
    # If utilization_exceeded+
    es.bulk_update_next_runs(actions)
  end
end

# Reads responses from the dispatched agent jobs and updates 
def read_responses
  responses = es.read_get_and_delete_all_responses()
  responses.each do |r|
    # I'm actually not sure what to do with responses. On the happy path we can just throw them away
    # If the monitor encounters an error executing that's already handled
    # The only real use for these is errors maybe within fleet, or if heartbeat can't even create the
    # monitor. In that case I'd probably just log them since they should be extremely rare.
  end
end

Messaging overhead and perf concerns

This work queue requires one message, or ES document per job invocation, possibly two if you include the response (though I think we could skip responses for job invocation messages). Let's remember that heartbeat still needs to write the results to the cluster too, a minimum of one doc (a much larger doc usually than the job invocatio).

For browser based jobs the messaging overhead is by definition neglible, dwarfed by the multitude of results each browser job stores in ES. In ligthweight jobs the ration is different, in the worst case this requires a doubling of write capacity, perhaps a trippling if you count deleting the messages, for ligthweight jobs. Storage capacity should be unaffected since these fleet actions are not retanied.

Let's take the example of a very large scale user with say 15000 lightweigth jobs and 5000 browser jobs scheduled to run every 5m, this works out to 66 jobs/second. Regardless of scheduling they need a cluster that can handle at least 44 ligthweight result docs, plus at let's say 300 docs per browser run, (300*22)=6600 browser docs. In this scenario their baseline is 6644 docs / second. The overhead of messaging is maybe only 198 docs/s extra. They already need to have capacity to account for quite a bit of throughput here, the additional reqs are a rounding error.

There may be more concern about kibana perf here, specifically around deserializing monitors, especially encrypted fields. However, we should benchmark that, and see what our true throughput is here. If it's a problem we can likely find a way to only send the full config when seeing an agent for the first time.

Conclusion

For a first pass I suggest we use the work queue. Yes, it introduces messaging overhead for lightweight monitors, but it solves the more serious problem of browser monitor scheduling. Let's remember, most users have double digit numbers of monitors, around 40 or so, where overhead is small. For larger customers I would wager this will still work for most common scales of operation, esp. given that users usually mix lightweigth and browser monitors, and a single browser monitor is a large multiple of a single lightweight monitor.

Finally, we could look into fancier ways to pre-plan schedules for the lightweight monitors, like the segmented slots proposal or even fancier things, but only if it actually is needed. These methods would severely reduce messaging overhead, at the cost of more code complexity.

joshdover commented 2 years ago

Thanks for the detailed writeup @andrewvc. I suspect the message overhead of the basic work queue approach to be negligible as well. Likely the most valuable benefit of the other approaches is that it removes Kibana (and Task Manager) from the critical path of executing monitors. When Kibana is down or Task Manager / Alerting is being heavily used, the task running to schedule Uptime monitors will be backed up as well, which will delay running monitors.

This concern is slightly 'mitigated' by the fact that if alerting is slowed down, users aren't going to receive alerts about their down monitors, regardless of whether or not Uptime monitors are executing fast enough.

Another way to solve this could be to introduce a concept of task priority to the task manager queue in Kibana, which has not yet been solved: https://github.com/elastic/kibana/issues/75041

lucasfcosta commented 2 years ago

@andrewvc thanks for such a detailed write-up! Before I go on holidays I thought I'd also outline yet one more approach which came to mind after our last discussion: create a pull-based mechanism from the Hearbeat side, rather than pushing work through Kibana.

In this comment, I'll explain how this approach would work, and why I think it might be a good one. Then, I'll relate this approach back to the constraints and ACs you outlined previously.

How the pull-based mechanism would work

The pull-based mechanism would involve more substantial changes to Heartbeat than to Kibana.

It consists of Heartbeat instances pulling work from the Kibana/ES queue whenever they're free. Such an implementation turns Heartbeat into a "consumer" which actively looks for things to do, rather than being told what to do next.

The item(s) to be executed by Heartbeat are provided by a Kibana API which will always return the due jobs with the highest priority.

➡️ Note that the optimal number of jobs to pull per request is, theoretically, 1. That's because by pulling one job at a time we ensure that the next job will only be pulled as early as possible when an agent is available. Imagine, for example, that Agent A pulls two jobs X and Y. If Agent B becomes free while agent A is executing X, Y would be executed later than it could have been if it was still in the queue when Agent B became available. We can, however, reduce the number of requests if we decide that it's worth pulling more jobs than 1, especially if we can increase the number of parallel monitor runs.

To determine which monitors are due to run, we consider a monitor's "acceptable execution window", not its execution time. Whenever a monitor is pulled, its "acceptable execution window" grows by <SCHEDULE> seconds.

At the time Hearbeat instances ask for jobs, Monitors whose the "acceptable execution window end" is greater than the current time are considered "due to run".

schedules_windows

How do we define priorities

Priorities are defined considering both a monitor's schedule and how long it takes to run.

We calculate priorities that way because it's better for a @every 30m monitor to be delayed 30s than for a @every 1m monitor to be delayed 30s. In other words, the cost of the delay is relative to the monitor's schedule.

For the sake of this example we'll calculate the cost of delay by dividing 1 / schedule (seconds), which gives us the cost of delay per second.

Considering monitors with the shortest schedule are the most expensive ones to delay, in order to optimise value per second of execution we run the quickest monitors which have the highest cost of delay first.

As a visual representation of how WSFJ reduces CoD per time, I like this example from one of Donald Reinertsen's books:

Imagine, for example, that you have the following monitors:

Monitor A / Schedule: @every 30s (CoD/s: 0.033) / Duration: 20s / WSJF: CoD/Duration = 0.00165
Monitor B / Schedule: @every 60s (CoD/s: 0.016) / Duration: 15s / WSJF: CoD/Duration = 0.00106
Monitor C / Schedule: @every 15s (CoD/s: 0.06 )/ Duration: 30s / WSJF: CoD/Duration = 0.002

In that case, the optimal priority order for these jobs is C, A, B.

Now, if you consider the fact that monitors are only considered due to run at the end of their "acceptable execution window", this mechanism simply ensures we make the most out of the available resources in terms of reducing the cost per delay of a monitor.

👉 Even if we have multiple monitors with the very same schedules, we only guarantee they'll run within the acceptable execution window, not when within that window they're executed.

👉 We know Duration of a monitor because we run it immediately once it's created (monitors that never run get the highest priority). Therefore, I'd recommend us to always have at least 2 threads of execution to prevent a very long monitor from delaying others.

👉 Priority is calculated at the time of the request.

Advantages of this approach

It maximises resource usage: Heartbeat is always busy rather than relying on Kibana to push work whenever it thinks Heartbeat is available
It optimising for the lowest total cost of delay (maximum "economic" benefit per second of execution)
Doesn't rely on Kibana's task manager at all given all actions happen when HB makes a "pull" request, therefore, we don't need to worry about task manager delays here
IMO this is simpler to implement than the previous approach (subjective though)
The maximisation of resource usage is guaranteed but it's still very easy to alter how jobs get schedule and we could use different criteria for bumping them up, such as having a "Is Critical" field in the form which gives priority to a particular monitor or bumping a monitor's priority whenever it hasn't run within X of its acceptable execution time windows.

Disadvantages of this approach

Some jobs may never get to run I consider that acceptable because if we're maximising resource usage anyway, then it's better for some jobs to never run and for us to display an alert if they haven't run within their acceptable window rather than us serially delaying one job after the other. It's also worth noticing we're prioritising jobs which have the highest cost of delay, so the cost of never running these jobs is considered to be the smallest. To counter-balance this, we could bump a job's WSJF value by X whenever it misses a run within it's acceptable window.
It requires us to do a significant amount of computation at the time of every request. However, even at thousands of monitors this seems like quite an inexpensive computation to do anyway.
If there are no jobs to run, HB still has to poll Kibana, so we'd either need a mechanism to tell it when to pull again from Kibana or have a pull schedule on HB (which could be dynamic).

Constraints / ACs

Now, revisiting the ACs, I think this approach fulfils all of them:

Jobs to run at the specified interval under normal operation ✅
If needed jobs could run more frequently (an extra run here or there) if necessary (uptime handles this correctly) ✅

👉 A job will always run within its window of execution, no matter when. A job could run, for example, right at the end of its window and then right at the start of the next window. Running it twice within the same window would be incorrect IMO, so I consider this AC to be ✅ . Furthermore, with more than 1 execution thread we could easily even start the next run while the first one is still in progress.
Jobs can take varying amounts of time, but do have a maximum amount of runtime, let's say 15m OR their schedule interval, whichever is lower. 🟡

This is not necessary for the scheduling algorithm, so I've marked it as yellow because I think it's not directly related, although I do think a "killswitch" at 15m would be useful. I don't think we should use the schedule interval though because with multiple threads we could have monitors which take more than their schedule time and still run within their acceptable time windows always.
We want to make the most of our execution capacity ✅

Definitely a yes given this is pull-based, not push-based.
Monitors should run consistently at the interval specified ⚠️

Any algorithm would only be able to achieve this item if capacity > workload. Considering capacity = workload then this algorithm does run monitors consistently.

Algorithm summary

👉 Heartbeat pulls jobs rather than having Kibana push them.

Heartbeat asks for a job once it's got an execution slot available.
Kibana receives the request:
1. Filters monitors available (where current time > last acceptable execution window end)
2. From the monitors available, calculates the monitor with the highest cost of delay vs. execution time (ensure most value per second of execution) and returns it to HB
HB executes the monitor and sends results back.
HB becomes available again and pulls the next job.

Also worth noticing that:

The "jitter" for running monitors essentially becomes the time at which the next HB instance is available (so it's as if we always dynamically had the most optimal "jitter")
A pull-based algorithm maximises resource usage
Resource usage is always optimised so that the available resource is alway processing the quickest item it would be the most expensive to delay

andrewvc commented 2 years ago

Thanks for the thorough writeup @lucasfcosta ! I agree that there's a lot to be said for a pull vs. push architecture, and TBH I'd generally favor that myself. It's how most work queues are implemented.

The key reason for going with the push approach was to simplify network configuration, the push approach can work atop the existing fleet server comms channel. With a pull approach users need to make sure heartbeat instances have access to kibana, which they may not. It's another source of potential issues, network ACL configuration etc.

It does make me wonder if we can work in reverse in fleet. @joshdover @ph , could heartbeat initiate comms with kibana rather than the other way around?

emilioalvap commented 2 years ago

Thanks for the detailed overview @andrewvc! I've gone through the proposal a few times already and I just have a couple of points that I might not have picked from your comment:

Regarding overprovisioned flag in a scenario where total monitor execution time > total cpu available for a given location, how are we going to handle snowballing the local queue? Should we flush it on every message received?
Besides the background task, are we expecting to run this logic on demand when a monitor is edited/created, as it will happen on the service?

andrewvc commented 2 years ago

@emilioalvap what is the overprovisioned flag? I don't see that word elsewhere here, but I think I understand what you mean? At any rate, we could make sure the queue holds at most one message per monitor ID at a time. So, if one message is missed, subsequent ones don't pile up. Since everything in ES is easily indexed this should be easy, and could possibly be done even more optimally by encoding the monitor ID deterministically into the doc ID.

WRT the second question about running the task on demand, I think it comes down to benchmarking how efficiently it can run. If we can run it every, say 5s without any issue (since it'd usually be a noop) maybe we just do that? Otherwise, yes, having a way to just invoke it as a function makes sense with a global lock most likely (using a doc as a lock )

emilioalvap commented 2 years ago

@andrewvc Sorry about that, I meant overscheduled. You answered it anyway.

With this approach, will monitor configuration and source code still be added to fleet integration policy or sent to each individual agent on each iteration? Do these "action" documents need to be encrypted as well?

andrewvc commented 2 years ago

That's an excellent question about encryption. I'm not entirely sure what the best approach here is. Some thoughts:

Approach A

We should look into what security guarantees there are on this data at the moment. I assume these docs are already highly restricted, I don't see us needing much more in terms of encryption than the existing saved objects have. I'd expect those indices to already be highly restricted. @joshdover @ph thoughts?

We wouldn't want to share any secrets kibana uses, we may have to re-encrypt anyway to satisfy encryption for data at rest, but then that raises the question of where to store the key.

Approach B

The most paranoid thing we could do here would be use a public key from the intended heartbeat to encrypt stuff one way into the messages. I don't know if it'd be worth it to go even further and implement DH key exchange or one of its successors. It'd be interesting to see what the perf impact of this approach would be since most crypto protocols use symmetric encryption for perf with a key exchange.

Looking at these two options, approach B has the easiest to reason about security surface area. Any thoughts on your end @emilioalvap ?

andrewvc commented 2 years ago

After discussing this extensively with @paulb-elastic and @drewpost we've decided to revert to @dominiqueclarke 's original design.

While it would be nice to do load balancing etc. from the get go, it's not worth the extra effort at this point, we have other higher priority items on the roadmap to tackle. I've updated the top comment accordingly.

ph commented 2 years ago

@andrewvc FYI https://github.com/elastic/ingest-dev/issues/1000 concerning security/credentials.

paulb-elastic commented 2 years ago

Final decision for this is to call it Private Location

andrewvc commented 2 years ago

Closing in favor of https://github.com/elastic/uptime/issues/475

elastic / uptime

[Agent][Synthetics] On prem monitoring locations via Fleet. AKA Private Location #441

Approach

Summary from 3/15/12 Meeting

Constraints / ACs

Performance considerations and limitations of the heartbeat scheduler today

Low messaging overhead - naive approach

Low messaging overhead - schedule segmented slots

High messaging overhead - central work queue

Messaging overhead and perf concerns

Conclusion

How the pull-based mechanism would work

How do we define priorities

Advantages of this approach

Disadvantages of this approach

Constraints / ACs

Algorithm summary

Approach A

Approach B