GoogleCloudPlatform / microservices-demo

Sample cloud-first application with 10 microservices showcasing Kubernetes, Istio, and gRPC.
https://cymbal-shops.retail.cymbal.dev
Apache License 2.0
17.05k stars 7.33k forks source link

Increment profiler version at release time #641

Open Shabirmean opened 2 years ago

Shabirmean commented 2 years ago

Describe request or inquiry

The @google-cloud/profiler is used in our services to produce profiling information and to track issue when something is noticed. The profiler is always initialized with a version. This version needs to be updated and maintained with releases of the applications so that whenever we notice a sudden issue with the latest release, we can compare profiling information against previous version.

Since the first release of the sample app, the profiler version has never been updated. (see currencyservice, paymentservice and frontend as examples) . This version should match sample application release version.

Thus, we must add steps in the release process documentation to instruct the release-owner to manually create PRs for updating the profiler version on the release branches. We should then later add automation to update these versions automatically during the release process.

Doing this will ensure that we can filter through profiler data by release version and compare them to see where resource utilization has changed over time. This enables tracking down issues like #538 easier.

What purpose/environment will this feature serve?

ahrarmonsur commented 2 years ago

@NimJay has this been incorporated in to the roadmap planning for microservices-demo?

NimJay commented 2 years ago

@ahrarmonsur, good question. And no, this has not been incorporated into the 2022 sample app roadmap.

It would be a nice thing to have though. As Shabir mentioned, it could help us during investigations for bugs like https://github.com/GoogleCloudPlatform/microservices-demo/issues/538.

Versioning Dilemma

With respect to versioning, here are some different options:

  1. version each microservice independently (e.g., cartservice could be at 0.2.4 and paymentservice could be at 4.3.0). This is more in line with reality (i.e., in practice, different teams own each microservice and create new releases of each microservice at their own pace). We could mimic this by bumping the version of a microservice each time a code change happens to that microservice, but this would add more complexity and maintenance. I'm not a fan.
  2. use the version of this repository itself for all microservices (e.g., today, each microservice would be versioned 0.3.6). This is simple and only requires modifying the release automation (i.e., every time we create a new Online Boutique release, we update the version of each microservice).
  3. We could pass in the git hash of HEAD to @google-cloud/profile during runtime. For instance, currencyservice/server.js#L25 would simply get the version from an environment variable. This environment variable would have to be set inside the Dockerfile. Using the git hash will allow us to pin point exactly which change caused a certain performance change.

I, personally, prefer option 3. @Shabirmean, did you have any thoughts around this?

Shabirmean commented 2 years ago

@NimJay - thank you for outlining the options above. All great suggestions

I agree with you on Option-1. It might not be worth the investment to take on all that complexity. Also since we always release micro-services together, it's better to keep the versioning also the same.


Options 2 & 3 are both good, but we would want to pick one based on how it would actually get used in a real scenario. As I see, this is what happens:

  1. Somebody notices an issue with the app and reports to us
  2. We reproduce it in main
  3. We want to find where it was introduced
  4. To isolate where things might have gone wrong we have to compare some previous states to locate the change that caused the issue.

So for us to do (4) ideally we would have a few previous releases in some staging/test-cluster which would have already generated useful profiler data. We can use this to compare against later versions to see where we see the anomaly happening and what are the differences.

Going by the above thought process I think Option-2 is sufficient. It is unlikely that we would setup deployments pinned to hashes that are close to each other; and it adds another level of thinking for a developer to decide which hashes are good picks to start debugging. The question of which are good hashes is answered by the fact that we can use release-hashes (or versions) as the test boundaries. So I think using release versions as suggested in Option-2 is a good choice in this case.

Also, if we have the versions set in our production deployments, we will automatically have profiler data generated between current release and next release.

minherz commented 1 year ago

@Shabirmean , I have assigned the issue to you. Please, log the next steps or close the issue.

Shabirmean commented 1 year ago

@minherz - thank you for checking up on this issue. This is an improvement suggestion to the repo. I think the description of the issue and the following conversations, sufficiently summarizes the ask. We would only need to add more details if this issue is prioritized to be worked on.

It is up to the repo TL (@NimJay ) to decide whether this is top of mind or not and how to prioritize this. I will remove myself from being the assignee and leave it to Nim to make a call on this issue.

bourgeoisor commented 1 year ago

I think for best practices this is still important. I agree with Shabir: I think option 2 is the best.