BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

Discussion about Service Mesh for Openshift clusters #2973

Open ShellyXueHan opened 2 years ago

ShellyXueHan commented 2 years ago

Describe the issue Kevin brought up the benefits for having service mesh on our cluster for better security and monitoring of network traffic. Let's have a discussion with the team to see the options and decide if we should move forward with it.

What is the Value/Impact? provides more detailed network traffic information

What is the plan? How will this get completed? Have a meeting to discuss about:

Identify any dependencies

Definition of done

StevenBarre commented 2 years ago

Let's make sure to bring @gnunn1 and Matt the TAM along

ksummersill2 commented 1 year ago

@StevenBarre - How difficult would it be to add Istio or Linkerd as a Service Mesh to klab?

StevenBarre commented 1 year ago

We've got a backlog ticket for it https://app.zenhub.com/workspaces/platform-experience-5bb7c5ab4b5806bc2beb9d15/issues/bcdevops/developer-experience/2990 It would take a bit of time to add it to CCM, and then learn about its operation and ensure our on-call team is up to speed on maintaining it. We also need to make sure it is properly multi-tenant, as we've seen issues with other tools there before.

gnunn1 commented 1 year ago

With respect to ServiceMesh I would recommend Istio over Linkerd simply because OpenShift provides OOTB support for Istio as part of OpenShift Service Mesh as well as support for multi-tenancy which is critical to BC Gov:

https://docs.openshift.com/container-platform/4.10/service_mesh/v2x/ossm-about.html

Keep in mind that service mesh is something application teams need to actively participate in and configure, it's not really an install only thing and suddenly everyone benefits from it. While the mesh does provide some benefits around observability this observability is restricted to the mesh. Note there are some improvements coming to core OpenShift with regards to network observability plus there is the ability to run some tools like OpenTracing (i.e. Jaegar) outside of the Mesh as well.

For me the real benefit of the mesh is effectively managing the routing of traffic at the platform level between various components, particularly when dealing with microservice deployments, as well as externalizing some of the security aspects via mutual TLS. For teams that are running a small number of static deployments the benefits of the mesh are much lower, teams that run microservices which are constantly being developed and updated will see more ROI IMHO.

Like anything else this would be a platform service, so there would need to be processes in place to provision, manage and support Istio control planes on a per team for teams that want to use it.

ksummersill2 commented 1 year ago

@gnunn1 - The reason for running the Service Mesh is tracking USE Method, Red Method/Golden Signals with things like latency, and a few other items. Such as tracking the latency between the services. Do you have the link to Jaegar?

gnunn1 commented 1 year ago

You can find the docs on Distributed Tracing here:

https://docs.openshift.com/container-platform/4.10/distr_tracing/distributed-tracing-release-notes.html

Note that without Service Mesh it does require your application be configured to explictly use it, for most modern dev frameworks this is typically a configuration setting which will get you some automatic tracing OOTB (typically ingress/egress from pod type stuff) and then you can add explicit tracing where you need. With service mesh it will handle the ingress/egress via Envoy automatically but again most apps will want to add some tracing internally as well.

ksummersill2 commented 1 year ago

@gnunn1 I am aware of injecting a service mesh into an application, as well as the ingress and egress. Just attempting to find tools and services that can meet the needs tracking, availability, latency, errors, saturation, usability, and etc. Thank you for quickly responding.