Open ccicconetti opened 3 weeks ago
@alvarocurt please add here other suggested changes
Sure, sorry for the delay. I'll punctuate each change from 1 to 5 in order of relevance:
resource_provider:<provider_type>:<provider_ID>
(2)instance:resource:<provider_type>:<instance_logical_UUID>:info
. Suggested parsing is in the slides (2)instance:function:<function_type>:<instance_logical_UUID>:info
. Suggested parsing is in the slides (2)instance:resource:<provider_type>:<instance_logical_UUID>:<metric_name>
dependency:
entries have information that could be included in the instance:...:info
entries (2)node:health:<node_UUID>:<timestamp>
. This will need to add a TTL to the health status entries, for their automatic removalnode:function_executions:<function_physical_UUID>
, I'll try think of a better way to correlate physical_UUIDs with logical_UUIDs with the least pain possible for everybody (2)From an Anomaly Detection standpoint, this is the expected interaction with the currently available metrics:
Anomaly detection sees node:
and provider:
entries. 2 approaches to these metrics:
node:capabilities:*
and provider:*
once, and node:health:*
every 2 seconds. For predictable tests.node:*
and provider:*
every 2 seconds. The amount of nodes is never going to be too high so this its ok, just redundant to process static data every time.Suggestion: Provide an alternative way to notify cluster changes. Maybe an entry like node_list
. AD will periodically pull only that (and health metrics) and thus check if there is any change in the Orchestration Domain nodes. The resource providers of each node can't currently change dynamically so they do not pose any problem
New metrics dependency:
and instance:
appear. There is no clean way to just check the current workflows in the cluster, so the AD has to periodically pull every dependency:*
and instance:*
entry every 2s, in case a new one is added.
instance:
entries to get the current functions/resources deployed, and their node. No clean way to distinguish between functions and resources, just that resources have a class_type
field in the information.dependency:
entries to form a map of the sequence of function UUIDs that may look like:
flowchart LR
http-ingress["32c6d750-07c8-4dfc-b2e8-7ad47d025f55"] --> external_trigger["5f655923-7c37-4ff9-bbcd-59097aef13ea"]
external_trigger --> incr["83e08333-4393-443e-992b-fe59f274d221"]
incr --> double["baaf9af1-8644-466a-935f-1e284ebf1ecb"]
double --> external_sink["cf0b4934-8a6c-49b4-995b-3a7f395a6ea4"]
external_sink --> http-egress["8eb6883a-478a-442b-bb17-daa489c31b76"]
flowchart LR
http-ingress["917ada61-6022-4ab3-a33e-f604ae330a96"] --> external_trigger["ff5316e1-c91e-455e-b837-5866dcdadf3e"]
external_trigger --> incr["eb537d95-f605-40c7-8f54-8b84879d8533"]
incr --> double["4f0f0562-bc89-4ad4-a892-a2b0179eb3de"]
double --> external_sink["88f9fa97-0dea-4c27-a2f0-3af7be1747e8"]
external_sink --> http-egress["e363abab-41bc-4507-bb20-62a5c2426c9e"]
Suggestion: Same as before, but more urgently, there is no pretty way of inferring the amount of WFs in the cluster and their associated functions. The nº of WFs needs to be deduced by making the dependency maps, which is a redundant process every 2 seconds. A way to solve this would be creating entries workflow:<UUID>
with all the information of its instances (functions and resources): logical UUIDs, physical UUIDs, dependencies and annotations/configurations.
The only WF information that is mutable are the physical UUIDs, as functions can be moved across nodes. But again, pulling this information every 2 seconds without having to rebuild the dependency map shouldn't be much effort.
Every time a workflow is executed (when an external resource is called, and a sequence of functions is then executed),
the performance:<physical_UUID>
entry in the Redis adds a new line with timestamp and execution duration.
AD can associate the physical_UUID with the logical_UUID and the node via the previously known maps, and pulls these entries every 2 seconds.
However, older performance times are NEVER deleted. For controlled benchmark scenarios this is not a problem, but longlasting WFs with a decent use can accumulate an excessive number of information that can fill the memory in the Redis host. Performance samples need some kind of garbage collection/periodic removal to avoid this.
Also the AD can't just know how many timestamps are enough for doing inferrence. "List" entries in redis allow for querying the N latest entries, but without checking the values, so we can't query the durations since timestamp X.
This may be another reason to change this entry to another type, like sorted sets/ZSET:
Key name: duracion:<function_physical_UUID>
Entry format: timestamp:duration
Score: timestamp
Garbage collection: with a function like ZREMRANGEBYSCORE
From an Anomaly Detection standpoint, this is the expected interaction with the currently available metrics:
Edgeless starts
Anomaly detection sees
node:
andprovider:
entries. 2 approaches to these metrics:
- Pull
node:capabilities:*
andprovider:*
once, andnode:health:*
every 2 seconds. For predictable tests.- Pull
node:*
andprovider:*
every 2 seconds. The amount of nodes is never going to be too high so this its ok, just redundant to process static data every time.Suggestion: Provide an alternative way to notify cluster changes. Maybe an entry like
node_list
. AD will periodically pull only that (and health metrics) and thus check if there is any change in the Orchestration Domain nodes. The resource providers of each node can't currently change dynamically so they do not pose any problem
Alternative: use Redis keyspace notifications.
Workflow registration
New metrics
dependency:
andinstance:
appear. There is no clean way to just check the current workflows in the cluster, so the AD has to periodically pull everydependency:*
andinstance:*
entry every 2s, in case a new one is added. ... Suggestion: Same as before, but more urgently, there is no pretty way of inferring the amount of WFs in the cluster and their associated functions. The nº of WFs needs to be deduced by making the dependency maps, which is a redundant process every 2 seconds. A way to solve this would be creating entriesworkflow:<UUID>
with all the information of its instances (functions and resources): logical UUIDs, physical UUIDs, dependencies and annotations/configurations. The only WF information that is mutable are the physical UUIDs, as functions can be moved across nodes. But again, pulling this information every 2 seconds without having to rebuild the dependency map shouldn't be much effort.
By design, the ε-ORC does not know about workflows.
The information about the workflow composition is available only at the ε-CON.
For now, workflow information cannot be made available in the local Redis.
In the (near) future, i.e., when the ε-CON is implemented, workflow information will be made available in the "global observability" platform (possibly another Redis in memory database).
Workflow execution
Every time a workflow is executed (when an external resource is called, and a sequence of functions is then executed), the
performance:<physical_UUID>
entry in the Redis adds a new line with timestamp and execution duration. AD can associate the physical_UUID with the logical_UUID and the node via the previously known maps, and pulls these entries every 2 seconds. However, older performance times are NEVER deleted. For controlled benchmark scenarios this is not a problem, but longlasting WFs with a decent use can accumulate an excessive number of information that can fill the memory in the Redis host. Performance samples need some kind of garbage collection/periodic removal to avoid this. Also the AD can't just know how many timestamps are enough for doing inferrence. "List" entries in redis allow for querying the N latest entries, but without checking the values, so we can't query the durations since timestamp X. This may be another reason to change this entry to another type, like sorted sets/ZSET: Key name:duracion:<function_physical_UUID>
Entry format:timestamp:duration
Score:timestamp
Garbage collection: with a function likeZREMRANGEBYSCORE
The risk of accumulating too many samples: yes, this can be a problem. The proposed method, i.e., using a sorted sed instead of a list, coupled with periodically removing old samples, could be a viable solution!