edgeless-project / edgeless

MVP Implementation
Other
18 stars 1 forks source link

Observability metric updates #220

Open ccicconetti opened 3 weeks ago

ccicconetti commented 3 weeks ago
ccicconetti commented 3 weeks ago

@alvarocurt please add here other suggested changes

alvarocurt commented 3 weeks ago

Sure, sorry for the delay. I'll punctuate each change from 1 to 5 in order of relevance:

  1. Cosmetic
  2. Nice to have
  3. Facilitates future work
  4. Improvements of what we have
  5. Unblocks functionality development/allows for new functionalities.
alvarocurt commented 1 week ago

From an Anomaly Detection standpoint, this is the expected interaction with the currently available metrics:

Edgeless starts

Anomaly detection sees node: and provider: entries. 2 approaches to these metrics:

  1. Pull node:capabilities:* and provider:* once, and node:health:* every 2 seconds. For predictable tests.
  2. Pull node:* and provider:* every 2 seconds. The amount of nodes is never going to be too high so this its ok, just redundant to process static data every time.

Suggestion: Provide an alternative way to notify cluster changes. Maybe an entry like node_list. AD will periodically pull only that (and health metrics) and thus check if there is any change in the Orchestration Domain nodes. The resource providers of each node can't currently change dynamically so they do not pose any problem

Workflow registration

New metrics dependency: and instance: appear. There is no clean way to just check the current workflows in the cluster, so the AD has to periodically pull every dependency:* and instance:* entry every 2s, in case a new one is added.

  1. AD uses the instance: entries to get the current functions/resources deployed, and their node. No clean way to distinguish between functions and resources, just that resources have a class_type field in the information.
  2. AD uses the dependency:entries to form a map of the sequence of function UUIDs that may look like:
    flowchart LR
    http-ingress["32c6d750-07c8-4dfc-b2e8-7ad47d025f55"] --> external_trigger["5f655923-7c37-4ff9-bbcd-59097aef13ea"]
    external_trigger --> incr["83e08333-4393-443e-992b-fe59f274d221"]
    incr             --> double["baaf9af1-8644-466a-935f-1e284ebf1ecb"]
    double           --> external_sink["cf0b4934-8a6c-49b4-995b-3a7f395a6ea4"]
    external_sink    --> http-egress["8eb6883a-478a-442b-bb17-daa489c31b76"]
  3. AD now forms a second map of the sequence of physical UUIDs that may look like:
    flowchart LR
    http-ingress["917ada61-6022-4ab3-a33e-f604ae330a96"] --> external_trigger["ff5316e1-c91e-455e-b837-5866dcdadf3e"]
    external_trigger --> incr["eb537d95-f605-40c7-8f54-8b84879d8533"]
    incr             --> double["4f0f0562-bc89-4ad4-a892-a2b0179eb3de"]
    double           --> external_sink["88f9fa97-0dea-4c27-a2f0-3af7be1747e8"]
    external_sink    --> http-egress["e363abab-41bc-4507-bb20-62a5c2426c9e"]

Suggestion: Same as before, but more urgently, there is no pretty way of inferring the amount of WFs in the cluster and their associated functions. The nº of WFs needs to be deduced by making the dependency maps, which is a redundant process every 2 seconds. A way to solve this would be creating entries workflow:<UUID> with all the information of its instances (functions and resources): logical UUIDs, physical UUIDs, dependencies and annotations/configurations. The only WF information that is mutable are the physical UUIDs, as functions can be moved across nodes. But again, pulling this information every 2 seconds without having to rebuild the dependency map shouldn't be much effort.

Workflow execution

Every time a workflow is executed (when an external resource is called, and a sequence of functions is then executed), the performance:<physical_UUID> entry in the Redis adds a new line with timestamp and execution duration. AD can associate the physical_UUID with the logical_UUID and the node via the previously known maps, and pulls these entries every 2 seconds. However, older performance times are NEVER deleted. For controlled benchmark scenarios this is not a problem, but longlasting WFs with a decent use can accumulate an excessive number of information that can fill the memory in the Redis host. Performance samples need some kind of garbage collection/periodic removal to avoid this. Also the AD can't just know how many timestamps are enough for doing inferrence. "List" entries in redis allow for querying the N latest entries, but without checking the values, so we can't query the durations since timestamp X. This may be another reason to change this entry to another type, like sorted sets/ZSET: Key name: duracion:<function_physical_UUID> Entry format: timestamp:duration Score: timestamp Garbage collection: with a function like ZREMRANGEBYSCORE

ccicconetti commented 1 week ago

From an Anomaly Detection standpoint, this is the expected interaction with the currently available metrics:

Edgeless starts

Anomaly detection sees node: and provider: entries. 2 approaches to these metrics:

  1. Pull node:capabilities:* and provider:* once, and node:health:* every 2 seconds. For predictable tests.
  2. Pull node:* and provider:* every 2 seconds. The amount of nodes is never going to be too high so this its ok, just redundant to process static data every time.

Suggestion: Provide an alternative way to notify cluster changes. Maybe an entry like node_list. AD will periodically pull only that (and health metrics) and thus check if there is any change in the Orchestration Domain nodes. The resource providers of each node can't currently change dynamically so they do not pose any problem

Alternative: use Redis keyspace notifications.

Workflow registration

New metrics dependency: and instance: appear. There is no clean way to just check the current workflows in the cluster, so the AD has to periodically pull every dependency:* and instance:* entry every 2s, in case a new one is added. ... Suggestion: Same as before, but more urgently, there is no pretty way of inferring the amount of WFs in the cluster and their associated functions. The nº of WFs needs to be deduced by making the dependency maps, which is a redundant process every 2 seconds. A way to solve this would be creating entries workflow:<UUID> with all the information of its instances (functions and resources): logical UUIDs, physical UUIDs, dependencies and annotations/configurations. The only WF information that is mutable are the physical UUIDs, as functions can be moved across nodes. But again, pulling this information every 2 seconds without having to rebuild the dependency map shouldn't be much effort.

By design, the ε-ORC does not know about workflows.

The information about the workflow composition is available only at the ε-CON.

For now, workflow information cannot be made available in the local Redis.

In the (near) future, i.e., when the ε-CON is implemented, workflow information will be made available in the "global observability" platform (possibly another Redis in memory database).

Workflow execution

Every time a workflow is executed (when an external resource is called, and a sequence of functions is then executed), the performance:<physical_UUID> entry in the Redis adds a new line with timestamp and execution duration. AD can associate the physical_UUID with the logical_UUID and the node via the previously known maps, and pulls these entries every 2 seconds. However, older performance times are NEVER deleted. For controlled benchmark scenarios this is not a problem, but longlasting WFs with a decent use can accumulate an excessive number of information that can fill the memory in the Redis host. Performance samples need some kind of garbage collection/periodic removal to avoid this. Also the AD can't just know how many timestamps are enough for doing inferrence. "List" entries in redis allow for querying the N latest entries, but without checking the values, so we can't query the durations since timestamp X. This may be another reason to change this entry to another type, like sorted sets/ZSET: Key name: duracion:<function_physical_UUID> Entry format: timestamp:duration Score: timestamp Garbage collection: with a function like ZREMRANGEBYSCORE

The risk of accumulating too many samples: yes, this can be a problem. The proposed method, i.e., using a sorted sed instead of a list, coupled with periodically removing old samples, could be a viable solution!