FuelLabs / infrastructure

Shared infrastructure templates for Fuel services
15 stars 13 forks source link

Setup Basic Telemetry (Prometheus support) #28

Closed Voxelot closed 1 year ago

Voxelot commented 2 years ago

We use tracing as per our coding standard. However, we need to be able to monitor metrics using systems like prometheus and jaeger.

Investigate how we'd integrate our tracing infrastructure into telemetry using tools like tracing-opentelemtry and setup minimum viable telemetry for fuel-core.

Goals:

Related Tasks:

digorithm commented 2 years ago

Let me know if you need any help here, I've had copious amounts of experience with Prometheus operating on a large scale.

rfuelsh commented 2 years ago
  1. Jaeger with Istio :

https://istio.io/latest/docs/tasks/observability/distributed-tracing/jaeger/

or

  1. Prometheus Service Monitors

https://observability.thomasriley.co.uk/prometheus/configuring-prometheus/using-service-monitors/

I have looked into tracing with Linkerd recently which is becoming popular as its more lightweight and easier to adapt than Istio - https://linkerd.io/2.10/tasks/distributed-tracing/ - but the issue is Linkerd doesn't have in built Ingress like Istio which is needed for routing on backend and load balancing (ingress gateway)

would love to learn about @digorithm perspectives on this

digorithm commented 2 years ago

I did tons of observability work at previous companies, and by far the best experience I had was with Prometheus + Grafana. So that's the one I've been personally using and advocating ever since.

We can get distributed tracing with Grafana: https://grafana.com/docs/grafana/latest/explore/trace-integration/ (supports Jaeger, tempo, and more), plus we get all the excellent Prometheus SDKs to add monitoring in our applications, Prometheus' node exporter for system metrics (no manual logging needed), and Grafana offers a lot of out-of-the-box dashboards for all of that.

Then we can opt to use Grafana cloud or self-host it, both are relatively simple to do.

rfuelsh commented 2 years ago

From @Voxelot :

" For telemetry, we may want to look at instrumenting something like this in addition to application level traces: https://medium.com/tezedge/halving-the-tezedge-nodes-memory-usage-with-an-ebpf-based-memory-profiler-2bfd32f94f69"

uberscott commented 2 years ago

I've been working on this issue this week.

I'm going to break this ticket into smaller tickets as there have been multiple steps to realizing our goals. So far this is how it shakes down (and some of these tasks are already 'done'):

  1. creation of a POC rust application that implements opentelemetry, deployment to local docker desktop and a custom Prometheus instance for taking learnings back to fuel_core
  2. implement opentelemetry in POC rust
  3. implement opentelemetry tracing in POC rust (this will allow opentelemetry to consume our existing tracing already throughout the application)
  4. ensure tracing in tower-http passes to opentelemetry (and new layers can be added).... I checked and it looks like fuel_core uses tower_http so we should trace on that.
  5. implementation of opentelemetry_prometheus which will convert telemetry data into a format that prometheus to consume (ironically this does not seem to offer a server implementation for prometheus to pull the metrics--i hope I haven't missed something)
  6. creation of a very simple http server instance for allowing prometheus to pull metrics (we have decided to go will metric pulling as per the advice of other teammembers with experience.... very ironically there doesn't seem to be a crate that serves these metrics from opentelemtry which is surprising and maybe I have missed something... however implementing this turned out to be exceptionally easy.)
  7. writing of Prometheus kubernetes pod scraping rules
  8. incorporation of learnings into fuel_core (meaning adding opentelemetry and initializing it through fuel_core's service implentation)
  9. modify production Prometheus deployment to scrape fuel_core metrics
  10. Multiple tickets will follow for implementing a laundry list of specific metrics within fuel_core itself (the list will be provided to me by Brandon)
Voxelot commented 2 years ago

@uberscott for future reference if there are lots of subtasks that don't have a resultant work artifact, I suggest using checklists within a single issue.

https://docs.github.com/en/issues/tracking-your-work-with-issues/quickstart#adding-a-task-list

uberscott commented 2 years ago

@Voxelot You got it. (next time)

uberscott commented 2 years ago

Added Grafana and now Grafana is also scraping prometheus (and I can see the counter data yippie!)

uberscott commented 2 years ago

bridging tracing with opentelemetry is the holy grail (so devs don't need to recode for opentelemetry we can just reuse our existing tracing intrumentation.). The tracing-opentelemetry crate that is supposed to make this bridging easy does not seem to work with the latest version of opentelemetry....

I am still trying to figure out what needs to be done because I may need to fix the tracing-opentelemetry crate.

Voxelot commented 2 years ago

Feel free to open issues with opentelemetry and submit fixes if you have the context. However, if opentelemetry is too immature we may just need to roll with our own tracing setup.

rfuelsh commented 2 years ago

https://node-beta-1.fuel.network/metrics - @uberscott