COPRS / rs-issues

This repository contains all the issues of the COPRS project (Scrum tickets, ivv bugs, epics ...)
2 stars 2 forks source link

[BUG] Metrics "rs_pending_processing_job" are not available for chains S1 and S3. #822

Open pcuq-ads opened 1 year ago

pcuq-ads commented 1 year ago

Environment:

Current Behavior: Only S2 chain provide rs_pending_processing_job.

Expected Behavior: All chain shall provide metric rs_pending_processing_job.

Steps To Reproduce: Check rs_pending_processing_job metric from Prometheus.

Test execution artefacts (i.e. logs, screenshots…)

Request : rs_pending_processing_job
rs_pending_processing_job{addonName="l0c", application="s2-l0c-part2-pw-l0c-processor", application_guid="092c7307-9172-4c43-9cae-f11c77eaea63", application_name="pw-l0c", application_type="processor", container="prometheus-proxy", endpoint="http", instance="10.244.226.94:8080", instance_index="0", job="spring-cloud-dataflow-prometheus-proxy", level="0", mission="S2", namespace="processing", pod="spring-cloud-dataflow-prometheus-proxy-64758754c5-2rrkd", service="spring-cloud-dataflow-prometheus-proxy", stream_name="s2-l0c-part2"}
    0
rs_pending_processing_job{addonName="l0u", application="s2-l0u-part1-pw-l0u-processor", application_guid="c38522d0-9a45-4aaa-bac2-e12b92757c46", application_name="pw-l0u", application_type="processor", container="prometheus-proxy", endpoint="http", instance="10.244.226.94:8080", instance_index="0", job="spring-cloud-dataflow-prometheus-proxy", level="0", mission="S2", namespace="processing", pod="spring-cloud-dataflow-prometheus-proxy-64758754c5-2rrkd", service="spring-cloud-dataflow-prometheus-proxy", stream_name="s2-l0u-part1"}

Whenever possible, first analysis of the root cause Hypothesis : The prometheus metric endpoint is not activated for S1 and S3. We chech that properties are well defined on SCDF streams.

Bug Generic Definition of Ready (DoR)

Bug Generic Definition of Done (DoD)

LAQU156 commented 1 year ago

IVV_CCB_2023_w06 : Accepted Werum, Priority minor

LAQU156 commented 1 year ago

Werum_CCB_2023_w06 : Product backlog

w-jka commented 1 year ago

513: Shows how to scrape this metric from the S1 and S3 components. The metric is available on the REST Endpoint /actuator/prometheus. Where is this metric expected to be found?

w-fsi commented 1 year ago

@pcuq-ads : As Julian pointed out already the metrics seems to be exposed correctly. There are three parameters that can be set and configured, please check the common configuration: https://github.com/COPRS/processing-sentinel-1/tree/develop/docs/common#prometheus-metrics-configuration

We checked the endpoint in the operational environment today and it seems to be also exposing the data correctly. Either the metrics are expected in a different format than initially requested or we understand something wrong or the scrapper (or upstream components) might not process it correctly.

pcuq-ads commented 1 year ago

@w-fsi : I check configuration and prometheus metrics. AIO:

app.preparation-worker.metrics.mission=S1
app.preparation-worker.metrics.level=0
app.preparation-worker.metrics.addonName=l0aiop

L0ASP

app.preparation-worker.metrics.mission=S1
app.preparation-worker.metrics.level=0
app.preparation-worker.metrics.addonName=l0asp

S3 L0P

app.preparation-worker.metrics.mission=S3
app.preparation-worker.metrics.level=0
app.preparation-worker.metrics.addonName=l0p

But, there are no metrics for S1 and S3 processing.

image.png

w-jka commented 1 year ago

@pcuq-ads how are those metrics retrieved? via a curl on /actuator/prometheus the metric rs_pending_processing_job is available.

pcuq-ads commented 1 year ago

@w-jka : our main monitoring tool is Grafana. Grafana retrieves all monitoring information from data sources : postrgresl (PI and KPI), ES (trend anlaysis), LOKI ( log), prometheus (metrics).

Our need is to get the metric from Grafana. By opening Prometheus UI we check that rs_pending_processing_job is missing for S1 and S3. Ok for S2. Of course, the value is not get from Grafana.

A configuration shall be missing to make this metrics collected by Prometheus.

LAQU156 commented 1 year ago

Werum_CCB_2023_w07 : @pcuq-ads @Woljtek could you give more precisions about this issue, especially how Prometheus harvest this metric ?

pcuq-ads commented 1 year ago

@nleconte-csgroup : do you deploy a specific configuration into Prometheus to enable Prometheus to scrape rs_pending_processing_job metric for S2 ? It works well for S2 but S1, S3 implementation is not integrated on RS platform.

nleconte-csgroup commented 1 year ago

@pcuq-ads - Thanks to @pfabre-csgroup , I can answer you. The implementation was done like we agreed during the engineering meeting. If you want to check, the code is here :

And the imported libs :

w-jka commented 1 year ago

The implementation is exactly the same for S1 and S3. The same libraries are being used. We still suggest, that the error lies in the configuration of Prometheus, as it seems, that the metrics of S1 and S3 are not correctly being scraped.

nleconte-csgroup commented 1 year ago

I had a quick look on OPS side and the metric's name are different in the /actuator/metrics/ endpoint : S1 and S3 : rs_pending_processing_job S2 : rs.pending.processing.job

It looks like Prometheus is scraping metrics with . in the name and exposing them with _. It looks like a normal behaviour because all the other SCDF metrics are like that.

I think S1 and S3 should expose the metric with . and not with _ in the name.

w-fsi commented 1 year ago

Okay, at least it seems like this might explain the existing observations and why the exposed metrics are not consumed as expected.

@Woljtek The story #513 and related are asking for the undercore and not a with a dot in between. Can you please confirm that the S2 approach is the right one?

Woljtek commented 1 year ago

From the US #514 Expose S2 pending processing as gauge metric:

The metric "rs_pending_processing_job" has a "mission" label (S2) The metric "rs_pending_processing_job" has a "level" label (0, 1, 2, ...) The metric "rs_pending_processing_job" has a "addonName " label (l0u, l0c; ...)

The expected name was rs_pending_processing_job.

But the kept name does not matter. The question is : Which is the easier side to fix it?

I propose to review this issue in the next CCB system to make the choice together.

Woljtek commented 1 year ago

WEEKLY 17/02: Decision => WERUM is going to change to rs.prending.processing.job.

w-fsi commented 1 year ago

Metric naming changed according to decision.

pcuq-ads commented 1 year ago

SYS_CCB_2023_w13 : The issue is still there with following version :

Only chain S2 are retrieved by Prometheus. image.png

w-fsi commented 1 year ago

First the fix was provided in February, so the 1.6.1 and 1.10 will not contain it and thus would explain why it is not available for these versions as they are still using the same pattern as described in the original story.

For the S1 that are using the version that does include it, we don't see any issue in the operational environment. (Currently it seems not possible to attach screenshoots) a curl on the /actuator/metrics endpoint lists the "rs.pending.processing.jobs" a curl on the /actuator/prometheus endpoints does contain the pending jobs as well.

Can you specify what exactly occurs with the S1? We implemented it according to the specification and it is likely that there is still some kind of miscommunication as it is even working in the ops environment.

pcuq-ads commented 1 year ago

We use the latest version in operation. image.png image.png

Here is the procedure to see the problem in operation.

  1. Start Prometheus
  2. Find metric : rs_pending_processing_job
  3. Check the result image.png

As you can see, only Sentinel-2 metrics are available.

I check properties for L0ASP image.png

w-fsi commented 1 year ago

@pcuq-ads : There is no new input that we can provide. When executing the curl on the services itself, it does provide these data for us as well. So the interface is exposed. The only explanation we are having is that it is not harvested correctly or that the described interface is still not correct. Don't you have any logs from the harvesting system that might indicate why the data is not there or what is the expected endpoint by it?

pcuq-ads commented 1 year ago

Thank you @w-fsi , I see that you have implemented the recommendation from @nleconte-csgroup . I.E rs.prending.processing.job instead of rs_prending_processing_job

@nleconte-csgroup , Can you have a look to this problem of integration with Prometheus ?

Regards

nleconte-csgroup commented 1 year ago

Hey @pcuq-ads @w-fsi I think S1 and S3 may be missing the prometheus-rsocket-client dependency. See https://github.com/micrometer-metrics/prometheus-rsocket-proxy and https://github.com/COPRS/processing-sentinel-2/blob/60bdc69cb2e552e3c5e185c07623500548ff9abb/apps/build.gradle#L102

w-fsi commented 1 year ago

And the client dependency will expose the interface at a different location? We are using micrometer as discussed in the meeting and the default location to expose it seems to be /actuator/prometheus. This is working fine as you can see on the screenshot from the ops environment:

Screenshot_20230404_083748

We might give it a try to check if the new dependency will expose it at a different location as well, but this issue might be easily solvable by modifying the endpoint that is polled.

nleconte-csgroup commented 1 year ago

My bad it's not the prometheus-rsocket-client dependencies but the prometheus-rsocket-spring that is missing.

In fact none of the metrics exposed in /actuator/prometheus from S1/S3 are found prometheus. For e.g. with jvm_memory_max_bytes, we have the metric exposed on the pod : image

image

While on S2 : image

The problem is not only for the metric rs_pending_processing_job.

The goal of the prometheus-rsocket-spring is to push the exposed metric to the prometheus proxy from SCDF.

w-fsi commented 1 year ago

The new dependency had been added on the current build of the development branch. However, I am not sure if this actually changed anything. Is there something we can check to verify if it is working as expected before doing another delivery?

nleconte-csgroup commented 1 year ago

The new dependency had been added on the current build of the development branch. However, I am not sure if this actually changed anything. Is there something we can check to verify if it is working as expected before doing another delivery?

You shall see the client trying to connect to the proxy at regular interval. You may only observe it on a platform with SCDF configured, as it is delivered on RS.

pcuq-ads commented 1 year ago

Dear @nleconte-csgroup ,

Can you propose a pull request to help @w-fsi fixing the issue ? Perhaps, something to add on POM file :

Regards

w-fsi commented 1 year ago

@pcuq-ads : The depdency is already added, we are just not able to test it in our cluster and don't see any additional log or anything that might indicate it is working

pcuq-ads commented 1 year ago

SYS_CCB_w17 : the issue is still there.

pcuq-ads commented 1 year ago

RSRRv2_SystemCCB : issue minor. The production is not impacted. We live with it.

vgava-ads commented 1 year ago

Delivered in Production-Common v1.13.1 (Refer to https://github.com/COPRS/production-common/releases/tag/1.13.1-rc1)

nleconte-csgroup commented 1 year ago

The issue is still present : no metrics are found in prometheus.

I digged a litle more and the only difference I see is that Werum exposes all endpoints : image.png

On S2, it exposes by default 7 endpoints : image.png

And on S1/S3 it exposes 28 endpoints : image.png

In the logs of the spring cloud rsocket proxy, we had a lot of out of memory errors : image.png

Maybe because it has too many endpoints and metrics so retrieve, the roskcet proxy is failing in error and cannot retrieve the metrics. I think it's worth a last try to either remove the exposure of all endpoints, or to temporarily remove the memory limits on the spring cloud rsocket proxy pods.

pcuq-ads commented 1 year ago

SYS_CCB_w29 : no solution founded. Tag checked is present.

w-fsi commented 1 year ago

Werum_CCB_2023_w30: Implementation was done, the reason why it is still not working was not found. Won't fix by Werum.