ITISFoundation / osparc-simcore

🐼 osparc-simcore simulation framework
https://osparc.io
MIT License
46 stars 27 forks source link

metrics for number of services started is only showing legacy dynamic services #3336

Closed sanderegg closed 1 year ago

sanderegg commented 2 years ago

only the metrics from director-v0 are fetched nothing about the new dy-sidecar-powered services.

mrnicegyu11 commented 2 years ago

Any news here guys? @sanderegg @GitHK

The metric in question uses the following timeseries in its PromQL query: simcore_simcore_service_director_services_started_total

This comes directly from simcore, namely the director (presumably v0) and webserver.

So i guess this ticket requires changes in the backend, I would unassign myself for now and follow up once the exported metrics contain new dy-sidecar services ;) Let me know if this is alligned with your plans

sanderegg commented 2 years ago

@mrnicegyu11 no. following the discussion from yesterday with @elisabettai, if you can get the metrics without using this information it is not needed anymore. Let me know and then we can add these also from the director-v2.

elisabettai commented 2 years ago

Hi @sanderegg, would it be possible to get information (e.g. when services are started) for the new dynamic services? I was trying to play a bit in Graylog, but I don't actually know where to start from.

sanderegg commented 2 years ago

@elisabettai , @mrnicegyu11 was saying he had a method for getting them. Is that still current? otherwise I need to add the call in the dynamic sidecar. This should not take too long (maybe 1 day or we could even ask @GitHK to do it)

GitHK commented 2 years ago

@elisabettai , @mrnicegyu11 was saying he had a method for getting them. Is that still current? otherwise I need to add the call in the dynamic sidecar. This should not take too long (maybe 1 day or we could even ask @GitHK to do it)

I can give a hand with this. Just let me know how it should work and where I can find something similar.

mrnicegyu11 commented 2 years ago

Hahaha :D @sanderegg true, so us two had a talk about this, and I was very confident I could extract this information because you can visually see when a service was started and when it stops. You were interested but less optimistic. Turns out it is pretty uch impossible, you had the right gut feeling ;)

I thoroughly pursued this but it turns out utilising the difference between "no timeseries" (= no service running) and "there exists a timeseries" for a specified time cannot be used/extracted using PromQL commands.

There are some options I guess depending on what is actually precisely needed. For example, if the total number of hours sim4life was run is desired, I think this can be summed (from the cpu_seconds or so for example). But "number of times s4l was started/stopped" is considerable harder to extract programmatically from prometheus, while visually (with my human eye), it is clear to see since there are timeseries that start when a service spawns and stop when the service stops. Counting this in binary (1 for each time a service is started) is almost impossible according to my research.

I hope what I was trying to say came across, if truly needed I'll go through my browser history and try to find the references from the time I tried to mess with this

sanderegg commented 2 years ago

@GitHK : that would be in services/director/src/simcore_service_director/producer.py line 903 and 1145 for the legacy services. I.E. it sends a message in the rabbit MQ for instrumentation. I think this might be all there is needed. Contact me if needed

sanderegg commented 2 years ago

@mrnicegyu11 , well good that you tried... would have been nice

GitHK commented 2 years ago

@sanderegg so I would not wait to be sure that the service was started successfully? I have several points in time when I can execute this:

Which would be the correct one?

sanderegg commented 2 years ago

well, started means that it is started and healthy.

sanderegg commented 2 years ago

otherwise I don't think we can count it as started @GitHK

elisabettai commented 2 years ago

Thanks @GitHK for having a look to this and @sanderegg, @mrnicegyu11 for your input. @GitHK, let me know how it goes and when we can have a look to the data in master.

The metrics they would like to see is: "Average number of times per month a service on the o2S2PARC platform was run, up to the top 10 most popular services averaged over the quarter reported on".

As a sanity check, I'm having a look to what we have in Grafana in master. Do the values make sense? I see sleeper:2.0.2 being used ~26K times on average per month, which sounds a bit too high also considering e2e and p2e tests every hour and every day. Maybe I am still getting those values on the y-axis wrong. image

mrnicegyu11 commented 2 years ago

@elisabettai Without doing the math explicitly I also find this high. But I guess this is a quantitative question so we need to check how many sleepers etc. are actually started per day by the e2e/p2e....

It might also be that the title of the graph is either wrong or selling the underlying PromQL query simpler than it is. Maybe consider explicitly checking the PromQL query if it truly calculates the average number of times per month averaged over the quarter (it is unclear to me on first sight what that even means :D )

sanderegg commented 2 years ago

@elisabettai , @mrnicegyu11 so a few things: this graph shows 2 different sleeper 2.0.2 ? looks suspicious. Also in theory: e2e runs: 1/hour (5 sleepers (serial test) + 5 sleepers (parallel test)) 24h * 30days = 7200 runs per month p2p runs: not running sleepers as far as I know

mrnicegyu11 commented 2 years ago

Running the query in prometheus instead of grafana gives a more verbose picture, something seems a bit off. There are multiple "deployments" on master, I guess this is not as intended and needs investigation

image

GitHK commented 2 years ago

otherwise I don't think we can count it as started @GitHK

@sanderegg We do count them as healthy for the older ones, If you look at the code. Nothing guarantees it started. That's why I'm asking this. We're gonna have 2 slightly different measured metrics. Someone needs to decide when to measure it.

sanderegg commented 2 years ago

@GitHK , ok so then do it the same. a service started is started

elisabettai commented 2 years ago

Just looking with @sanderegg, there seems to be some problems (on AWS prod for the webserver started services, e.g. this one&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1w&g1.expr=sum(rate(simcore_simcore_service_director_services_started_total%7Bservice_key%3D%22simcore%2Fservices%2Fdynamic%2F3d-viewer-gpu%22%7D%5B1d%5D)1246060)&g1.tab=0&g1.stacked=0&g1.show_exemplars=0&g1.range_input=8w&g2.expr=&g2.tab=1&g2.stacked=0&g2.show_exemplars=0&g2.range_input=1h))

elisabettai commented 1 year ago

@GitHK, any chance you can have a look to this soon-ish? We'd this for the report we need to finalize in two weeks.