On the Instance Dashboard we noticed that for some instances, despite there being many RPC failures, there were no recorded metrics for octopus_tentacle_halibut_RPC_active_calls:
Results
Before
Halibut considered "active RPC calls" to be the actual RPC portion of the communication only.
I.e., the point at which we are performing the call itself, excluding connecting etc.
This is why we never saw anything in octopus_tentacle_halibut_RPC_active_calls when the Tentacle did not exist, as it was failing during the "connection" phase, and therefore never got to the section where it was performing the call itself.
For example, here is an extract of the metrics after a failed health check, where the Tentacle was turned off. Note the lack of octopus_tentacle_halibut_RPC_active_calls:
After
We now consider "active RPC calls" to be the amount of time for Halibut to "perform the entire remote call process". This will include the time it takes to connect.
In the case of a polling tentacle, this will include the amount of time it takes to queue the item, wait for it to be dequeued, processed, and a result set (or a timeout).
Now, after a failed health check, we will see a metric recorded for octopus_tentacle_halibut_RPC_active_calls:
During a failed attempt, note that the count will be 1, as this call is now "active":
How to review this PR
Quality :heavy_check_mark:
Pre-requisites
[ ] I have read How we use GitHub Issues for help deciding when and where it's appropriate to make an issue.
[ ] I have considered informing or consulting the right people, according to the ownership map.
[ ] I have considered appropriate testing for my change.
[sc-64938]
Background
On the Instance Dashboard we noticed that for some instances, despite there being many RPC failures, there were no recorded metrics for
octopus_tentacle_halibut_RPC_active_calls
:Results
Before
Halibut considered "active RPC calls" to be the actual RPC portion of the communication only.
I.e., the point at which we are performing the call itself, excluding connecting etc.
This is why we never saw anything in
octopus_tentacle_halibut_RPC_active_calls
when the Tentacle did not exist, as it was failing during the "connection" phase, and therefore never got to the section where it was performing the call itself.For example, here is an extract of the metrics after a failed health check, where the Tentacle was turned off. Note the lack of
octopus_tentacle_halibut_RPC_active_calls
:After
We now consider "active RPC calls" to be the amount of time for Halibut to "perform the entire remote call process". This will include the time it takes to connect.
In the case of a polling tentacle, this will include the amount of time it takes to queue the item, wait for it to be dequeued, processed, and a result set (or a timeout).
Now, after a failed health check, we will see a metric recorded for
octopus_tentacle_halibut_RPC_active_calls
:During a failed attempt, note that the count will be 1, as this call is now "active":
How to review this PR
Quality :heavy_check_mark:
Pre-requisites