Open pcuq-ads opened 1 year ago
IVV_CCB_2023_w06 : Accepted Werum, Priority minor
Werum_CCB_2023_w06 : Product backlog
/actuator/prometheus
. Where is this metric expected to be found?@pcuq-ads : As Julian pointed out already the metrics seems to be exposed correctly. There are three parameters that can be set and configured, please check the common configuration: https://github.com/COPRS/processing-sentinel-1/tree/develop/docs/common#prometheus-metrics-configuration
We checked the endpoint in the operational environment today and it seems to be also exposing the data correctly. Either the metrics are expected in a different format than initially requested or we understand something wrong or the scrapper (or upstream components) might not process it correctly.
@w-fsi : I check configuration and prometheus metrics. AIO:
app.preparation-worker.metrics.mission=S1
app.preparation-worker.metrics.level=0
app.preparation-worker.metrics.addonName=l0aiop
L0ASP
app.preparation-worker.metrics.mission=S1
app.preparation-worker.metrics.level=0
app.preparation-worker.metrics.addonName=l0asp
S3 L0P
app.preparation-worker.metrics.mission=S3
app.preparation-worker.metrics.level=0
app.preparation-worker.metrics.addonName=l0p
But, there are no metrics for S1 and S3 processing.
@pcuq-ads how are those metrics retrieved? via a curl on /actuator/prometheus
the metric rs_pending_processing_job
is available.
@w-jka : our main monitoring tool is Grafana. Grafana retrieves all monitoring information from data sources : postrgresl (PI and KPI), ES (trend anlaysis), LOKI ( log), prometheus (metrics).
Our need is to get the metric from Grafana. By opening Prometheus UI we check that rs_pending_processing_job is missing for S1 and S3. Ok for S2. Of course, the value is not get from Grafana.
A configuration shall be missing to make this metrics collected by Prometheus.
Werum_CCB_2023_w07 : @pcuq-ads @Woljtek could you give more precisions about this issue, especially how Prometheus harvest this metric ?
@nleconte-csgroup : do you deploy a specific configuration into Prometheus to enable Prometheus to scrape rs_pending_processing_job metric for S2 ? It works well for S2 but S1, S3 implementation is not integrated on RS platform.
@pcuq-ads - Thanks to @pfabre-csgroup , I can answer you. The implementation was done like we agreed during the engineering meeting. If you want to check, the code is here :
And the imported libs :
The implementation is exactly the same for S1 and S3. The same libraries are being used. We still suggest, that the error lies in the configuration of Prometheus, as it seems, that the metrics of S1 and S3 are not correctly being scraped.
I had a quick look on OPS side and the metric's name are different in the /actuator/metrics/
endpoint :
S1 and S3 : rs_pending_processing_job
S2 : rs.pending.processing.job
It looks like Prometheus is scraping metrics with .
in the name and exposing them with _
. It looks like a normal behaviour because all the other SCDF metrics are like that.
I think S1 and S3 should expose the metric with .
and not with _
in the name.
Okay, at least it seems like this might explain the existing observations and why the exposed metrics are not consumed as expected.
@Woljtek The story #513 and related are asking for the undercore and not a with a dot in between. Can you please confirm that the S2 approach is the right one?
From the US #514 Expose S2 pending processing as gauge metric:
The metric "rs_pending_processing_job" has a "mission" label (S2) The metric "rs_pending_processing_job" has a "level" label (0, 1, 2, ...) The metric "rs_pending_processing_job" has a "addonName " label (l0u, l0c; ...)
The expected name was rs_pending_processing_job.
But the kept name does not matter. The question is : Which is the easier side to fix it?
I propose to review this issue in the next CCB system to make the choice together.
WEEKLY 17/02: Decision => WERUM is going to change to rs.prending.processing.job.
Metric naming changed according to decision.
SYS_CCB_2023_w13 : The issue is still there with following version :
Only chain S2 are retrieved by Prometheus.
First the fix was provided in February, so the 1.6.1 and 1.10 will not contain it and thus would explain why it is not available for these versions as they are still using the same pattern as described in the original story.
For the S1 that are using the version that does include it, we don't see any issue in the operational environment. (Currently it seems not possible to attach screenshoots) a curl on the /actuator/metrics endpoint lists the "rs.pending.processing.jobs" a curl on the /actuator/prometheus endpoints does contain the pending jobs as well.
Can you specify what exactly occurs with the S1? We implemented it according to the specification and it is likely that there is still some kind of miscommunication as it is even working in the ops environment.
We use the latest version in operation.
Here is the procedure to see the problem in operation.
As you can see, only Sentinel-2 metrics are available.
I check properties for L0ASP
@pcuq-ads : There is no new input that we can provide. When executing the curl on the services itself, it does provide these data for us as well. So the interface is exposed. The only explanation we are having is that it is not harvested correctly or that the described interface is still not correct. Don't you have any logs from the harvesting system that might indicate why the data is not there or what is the expected endpoint by it?
Thank you @w-fsi , I see that you have implemented the recommendation from @nleconte-csgroup . I.E rs.prending.processing.job instead of rs_prending_processing_job
@nleconte-csgroup , Can you have a look to this problem of integration with Prometheus ?
Regards
Hey @pcuq-ads @w-fsi I think S1 and S3 may be missing the prometheus-rsocket-client
dependency. See https://github.com/micrometer-metrics/prometheus-rsocket-proxy and https://github.com/COPRS/processing-sentinel-2/blob/60bdc69cb2e552e3c5e185c07623500548ff9abb/apps/build.gradle#L102
And the client dependency will expose the interface at a different location? We are using micrometer as discussed in the meeting and the default location to expose it seems to be /actuator/prometheus. This is working fine as you can see on the screenshot from the ops environment:
We might give it a try to check if the new dependency will expose it at a different location as well, but this issue might be easily solvable by modifying the endpoint that is polled.
My bad it's not the prometheus-rsocket-client
dependencies but the prometheus-rsocket-spring
that is missing.
In fact none of the metrics exposed in /actuator/prometheus from S1/S3 are found prometheus. For e.g. with jvm_memory_max_bytes
, we have the metric exposed on the pod :
While on S2 :
The problem is not only for the metric rs_pending_processing_job
.
The goal of the prometheus-rsocket-spring
is to push the exposed metric to the prometheus proxy from SCDF.
The new dependency had been added on the current build of the development branch. However, I am not sure if this actually changed anything. Is there something we can check to verify if it is working as expected before doing another delivery?
The new dependency had been added on the current build of the development branch. However, I am not sure if this actually changed anything. Is there something we can check to verify if it is working as expected before doing another delivery?
You shall see the client trying to connect to the proxy at regular interval. You may only observe it on a platform with SCDF configured, as it is delivered on RS.
Dear @nleconte-csgroup ,
Can you propose a pull request to help @w-fsi fixing the issue ? Perhaps, something to add on POM file :
Regards
@pcuq-ads : The depdency is already added, we are just not able to test it in our cluster and don't see any additional log or anything that might indicate it is working
SYS_CCB_w17 : the issue is still there.
RSRRv2_SystemCCB : issue minor. The production is not impacted. We live with it.
Delivered in Production-Common v1.13.1 (Refer to https://github.com/COPRS/production-common/releases/tag/1.13.1-rc1)
The issue is still present : no metrics are found in prometheus.
I digged a litle more and the only difference I see is that Werum exposes all endpoints :
On S2, it exposes by default 7 endpoints :
And on S1/S3 it exposes 28 endpoints :
In the logs of the spring cloud rsocket proxy, we had a lot of out of memory errors :
Maybe because it has too many endpoints and metrics so retrieve, the roskcet proxy is failing in error and cannot retrieve the metrics. I think it's worth a last try to either remove the exposure of all endpoints, or to temporarily remove the memory limits on the spring cloud rsocket proxy pods.
SYS_CCB_w29 : no solution founded. Tag checked is present.
Werum_CCB_2023_w30: Implementation was done, the reason why it is still not working was not found. Won't fix by Werum.
Environment:
Current Behavior: Only S2 chain provide rs_pending_processing_job.
Expected Behavior: All chain shall provide metric rs_pending_processing_job.
Steps To Reproduce: Check rs_pending_processing_job metric from Prometheus.
Test execution artefacts (i.e. logs, screenshots…)
Whenever possible, first analysis of the root cause Hypothesis : The prometheus metric endpoint is not activated for S1 and S3. We chech that properties are well defined on SCDF streams.
Bug Generic Definition of Ready (DoR)
Bug Generic Definition of Done (DoD)