Observability metric updates

ccicconetti commented 3 weeks ago

Telemetry resources
Save JSON data structures in Redis as JSON objects (not strings)
Add last health status updates to Redis (circular buffer)

ccicconetti commented 3 weeks ago

@alvarocurt please add here other suggested changes

alvarocurt commented 3 weeks ago

Sure, sorry for the delay. I'll punctuate each change from 1 to 5 in order of relevance:

Cosmetic
Nice to have
Facilitates future work
Improvements of what we have
Unblocks functionality development/allows for new functionalities.

Rename provider entries to resource_provider:<provider_type>:<provider_ID> (2)
Replace and parse resource instance metrics to instance:resource:<provider_type>:<instance_logical_UUID>:info. Suggested parsing is in the slides (2)
Replace and parse function instance metrics to instance:function:<function_type>:<instance_logical_UUID>:info. Suggested parsing is in the slides (2)
Custom metrics for the different resources. Resources, as the applications' interface with the outside world, are one of the main source of future anomalies, and insights about their usage in an EDGELESS level (not only node level) would be really helpful. The type of metrics a resource can generate depends on their implementation, but the execution duration isn't useful. Let's start with a counter or register of when http requests are sent/received in those providers. (5) instance:resource:<provider_type>:<instance_logical_UUID>:<metric_name>
Dumping provider metrics to files (3)
dependency: entries have information that could be included in the instance:...:info entries (2)
Having the X latest health status updates in the Redis. 2 ways (3)
- Renaming health entries to node:health:<node_UUID>:<timestamp>. This will need to add a TTL to the health status entries, for their automatic removal
- Using entries of type "sorted Sets", using the timestamp(+ ms) as the score. Allows for easy temporal querying, but set elements can't have a TTL. A periodic request deleting all elements with a score under (Current timestamp - X seconds) will need to also be done by the e-ORC.
Performance metrics, as node metrics, could be renamed to node:function_executions:<function_physical_UUID>, I'll try think of a better way to correlate physical_UUIDs with logical_UUIDs with the least pain possible for everybody (2)

alvarocurt commented 1 week ago

From an Anomaly Detection standpoint, this is the expected interaction with the currently available metrics:

Edgeless starts

Anomaly detection sees node: and provider: entries. 2 approaches to these metrics:

Pull node:capabilities:* and provider:* once, and node:health:* every 2 seconds. For predictable tests.
Pull node:* and provider:* every 2 seconds. The amount of nodes is never going to be too high so this its ok, just redundant to process static data every time.

Suggestion: Provide an alternative way to notify cluster changes. Maybe an entry like node_list. AD will periodically pull only that (and health metrics) and thus check if there is any change in the Orchestration Domain nodes. The resource providers of each node can't currently change dynamically so they do not pose any problem

Workflow registration

New metrics dependency: and instance: appear. There is no clean way to just check the current workflows in the cluster, so the AD has to periodically pull every dependency:* and instance:* entry every 2s, in case a new one is added.

AD uses the instance: entries to get the current functions/resources deployed, and their node. No clean way to distinguish between functions and resources, just that resources have a class_type field in the information.

AD uses the dependency:entries to form a map of the sequence of function UUIDs that may look like:

flowchart LR
http-ingress["32c6d750-07c8-4dfc-b2e8-7ad47d025f55"] --> external_trigger["5f655923-7c37-4ff9-bbcd-59097aef13ea"]
external_trigger --> incr["83e08333-4393-443e-992b-fe59f274d221"]
incr             --> double["baaf9af1-8644-466a-935f-1e284ebf1ecb"]
double           --> external_sink["cf0b4934-8a6c-49b4-995b-3a7f395a6ea4"]
external_sink    --> http-egress["8eb6883a-478a-442b-bb17-daa489c31b76"]

AD now forms a second map of the sequence of physical UUIDs that may look like:

flowchart LR
http-ingress["917ada61-6022-4ab3-a33e-f604ae330a96"] --> external_trigger["ff5316e1-c91e-455e-b837-5866dcdadf3e"]
external_trigger --> incr["eb537d95-f605-40c7-8f54-8b84879d8533"]
incr             --> double["4f0f0562-bc89-4ad4-a892-a2b0179eb3de"]
double           --> external_sink["88f9fa97-0dea-4c27-a2f0-3af7be1747e8"]
external_sink    --> http-egress["e363abab-41bc-4507-bb20-62a5c2426c9e"]

Suggestion: Same as before, but more urgently, there is no pretty way of inferring the amount of WFs in the cluster and their associated functions. The nº of WFs needs to be deduced by making the dependency maps, which is a redundant process every 2 seconds. A way to solve this would be creating entries workflow:<UUID> with all the information of its instances (functions and resources): logical UUIDs, physical UUIDs, dependencies and annotations/configurations. The only WF information that is mutable are the physical UUIDs, as functions can be moved across nodes. But again, pulling this information every 2 seconds without having to rebuild the dependency map shouldn't be much effort.

Workflow execution

Every time a workflow is executed (when an external resource is called, and a sequence of functions is then executed), the performance:<physical_UUID> entry in the Redis adds a new line with timestamp and execution duration. AD can associate the physical_UUID with the logical_UUID and the node via the previously known maps, and pulls these entries every 2 seconds. However, older performance times are NEVER deleted. For controlled benchmark scenarios this is not a problem, but longlasting WFs with a decent use can accumulate an excessive number of information that can fill the memory in the Redis host. Performance samples need some kind of garbage collection/periodic removal to avoid this. Also the AD can't just know how many timestamps are enough for doing inferrence. "List" entries in redis allow for querying the N latest entries, but without checking the values, so we can't query the durations since timestamp X. This may be another reason to change this entry to another type, like sorted sets/ZSET: Key name: duracion:<function_physical_UUID> Entry format: timestamp:duration Score: timestamp Garbage collection: with a function like ZREMRANGEBYSCORE

ccicconetti commented 1 week ago

From an Anomaly Detection standpoint, this is the expected interaction with the currently available metrics:

Edgeless starts

Anomaly detection sees node: and provider: entries. 2 approaches to these metrics:

Pull node:capabilities:* and provider:* once, and node:health:* every 2 seconds. For predictable tests.

Pull node:* and provider:* every 2 seconds. The amount of nodes is never going to be too high so this its ok, just redundant to process static data every time.

Suggestion: Provide an alternative way to notify cluster changes. Maybe an entry like node_list. AD will periodically pull only that (and health metrics) and thus check if there is any change in the Orchestration Domain nodes. The resource providers of each node can't currently change dynamically so they do not pose any problem

Alternative: use Redis keyspace notifications.

Workflow registration

New metrics dependency: and instance: appear. There is no clean way to just check the current workflows in the cluster, so the AD has to periodically pull every dependency:* and instance:* entry every 2s, in case a new one is added. ... Suggestion: Same as before, but more urgently, there is no pretty way of inferring the amount of WFs in the cluster and their associated functions. The nº of WFs needs to be deduced by making the dependency maps, which is a redundant process every 2 seconds. A way to solve this would be creating entries workflow:<UUID> with all the information of its instances (functions and resources): logical UUIDs, physical UUIDs, dependencies and annotations/configurations. The only WF information that is mutable are the physical UUIDs, as functions can be moved across nodes. But again, pulling this information every 2 seconds without having to rebuild the dependency map shouldn't be much effort.

By design, the ε-ORC does not know about workflows.

The information about the workflow composition is available only at the ε-CON.

For now, workflow information cannot be made available in the local Redis.

In the (near) future, i.e., when the ε-CON is implemented, workflow information will be made available in the "global observability" platform (possibly another Redis in memory database).

Workflow execution

Every time a workflow is executed (when an external resource is called, and a sequence of functions is then executed), the performance:<physical_UUID> entry in the Redis adds a new line with timestamp and execution duration. AD can associate the physical_UUID with the logical_UUID and the node via the previously known maps, and pulls these entries every 2 seconds. However, older performance times are NEVER deleted. For controlled benchmark scenarios this is not a problem, but longlasting WFs with a decent use can accumulate an excessive number of information that can fill the memory in the Redis host. Performance samples need some kind of garbage collection/periodic removal to avoid this. Also the AD can't just know how many timestamps are enough for doing inferrence. "List" entries in redis allow for querying the N latest entries, but without checking the values, so we can't query the durations since timestamp X. This may be another reason to change this entry to another type, like sorted sets/ZSET: Key name: duracion:<function_physical_UUID> Entry format: timestamp:duration Score: timestamp Garbage collection: with a function like ZREMRANGEBYSCORE

The risk of accumulating too many samples: yes, this can be a problem. The proposed method, i.e., using a sorted sed instead of a list, coupled with periodically removing old samples, could be a viable solution!

edgeless-project / edgeless