Scale Test Shared gRPC server-based Implementations

ulucinar commented 2 years ago

What problem are you facing?

We have produced a shared gRPC server-based implementation for provider-jet-azure in the context of https://github.com/crossplane/terrajet/issues/38. The provider-jet-azure packages ulucinar/provider-jet-azure-arm64:shared-grpc and ulucinar/provider-jet-azure-amd64:shared-grpc are modified to run the terraform-provider-azurerm binary plugin in the background as a shared gRPC server and the Terraform CLI does not have to fork the binary plugin for each of its requests.

How could Terrajet help solve your problem?

Similar to what we have done previously in https://github.com/crossplane/terrajet/issues/55, we need to reevaluate the performance of provider-jet-azure@v0.7.0 and also the shared gRPC implementation using the above provider packages. This will allow us to assess and quantify any performance improvements with the shared gRPC implementation. Some of the previously used scripts for #55 are available in https://github.com/ulucinar/terrajet-scale.

sergenyalcin commented 2 years ago

Here are the results from two experiments on the provider-jet-azure.

Experiment Setup:

On a GKE cluster with the following specs:

Machine family: General purpose e2-standard-4 (4 vCPU, 16 GB memory)
3 worker nodes
Control plane version - v1.20.11-gke.1300

crossplane-jet-azure-provider worked with 1 worker-count for two test cases.
The scripts, manifests and dashbord templates from https://github.com/ulucinar/terrajet-scale were used.
For collecting and reporting metrics, Prometheus Operator and Grafana was used. Prometheus Operator was installed by using the following helm chart. This chart also contains the Grafana: https://prometheus-community.github.io/helm-charts

Note: For previous tests and general context, please see this issue: https://github.com/crossplane/terrajet/issues/55

Case 1: Test provider-jet-azure v0.8.0 version without shared gRPC implementation

For this case the following image was used: crossplane/provider-jet-azure:v0.8.0

Firstly, provider-jet-azure v0.8.0 was deployed to the cluster. Then 50 VirutalNetwork and 50 LoadBalancer MRs were created simultaneously. Total = 100 MRs An example invocation of the generator script and an example generated MR manifest looks like the following:

$ ./manage-mr.sh create ./loadbalancer.yaml $(seq 1 50)
$ ./manage-mr.sh create ./virtualnetwork.yaml $(seq 1 50)

The following graphs are observed from the Grafana Dashboard:

The chart above shows us the MR counts and CPU/Memory Utilizations. As can be seen, CPU usage peaked at the beginning of the resource creation process and reached a level close to 40%. Although there are various fluctuations afterward, it can be said that the CPU usage from the beginning to the end of the test is between 25-30% on average.

The data above and the histogram show the time it takes to get ready for 100 created resources.

Case 2: Test provider-jet-azure with shared gRPC implementation

For this case the following image was used: ulucinar/provider-jet-azure-amd64:shared-grpc

Firstly, provider-jet-azure (with a custom image that contains the gRPC implementation) was deployed to the cluster. Then 50 VirutalNetwork and 50 LoadBalancer MRs were created simultaneously. Total = 100 MRs An example invocation of the generator script and an example generated MR manifest looks like the following:

$ ./manage-mr.sh create ./loadbalancer.yaml $(seq 1 50)
$ ./manage-mr.sh create ./virtualnetwork.yaml $(seq 1 50)

The following graphs are observed from the Grafana Dashboard:

The chart above shows us the MR counts and CPU/Memory Utilizations. As can be seen, CPU usage peaked at the beginning of the resource creation process and reached a level close to 21-22%. Although there are various fluctuations afterward, it can be said that the CPU usage from the beginning to the end of the test is between 15% on average.

Note: While testing this case, any stability issues were not observed, such as restarting the provider pod.

The data above and the histogram show the time it takes to get ready for 100 created resources.

Result:

When we check the CPU/Memory utilizations, we see a decrease. Both of average and peak values are lower in the gRPC based implementation case.
For Readiness time, all of the statistics show us the improvement in the gRPC based implementation case.

In the light of the above results, it is possible to say that the gRPC implementation makes a significant difference both in terms of resource consumption (CPU/memory) and the time it takes for resources to become Ready.

ulucinar commented 2 years ago

Thank you @sergenyalcin for carrying out these experiments, excellent work! Could you please also record the shared gRPC implementation image you have used in the experiments in your comment?

ulucinar commented 2 years ago

@sergenyalcin it may also be helpful to record in your comment that we have not observed any stability issues with the shared gRPC server in your experiments as it depends on non-production (testing) Terraform configuration. One important aspect of these experiments is to observe the stability of the shared gRPC implementation under load.

sergenyalcin commented 2 years ago

@ulucinar thank you for your comments. Both of two comments were addressed!

muvaf commented 2 years ago

Thanks @sergenyalcin ! I think we can conclude and close this issue and also https://github.com/crossplane/terrajet/issues/38 . The only risk seems to be that we'll be using an undocumented path but it's quite easy to turn on/off with a config so provider maintainers can choose whether they'd like to take the risk or not. @sergenyalcin @ulucinar do you agree?

The next step could be to open an issue targeting implementation of gRPC usage. Once an example usage is in provider-jet-template, we can update the guide and the Jet providers we're maintaining to that method.

sergenyalcin commented 2 years ago

@muvaf I think we can close this issue as you suggest.

As a summary of this scale tests, when the gRPC server-based implementation is used, there are significant improvements both in terms of resource consumption (CPU/memory) and the time it takes for managed resources to become Ready.

We can open another issue for tracking the implementation of gRPC usage.

crossplane / terrajet

Scale Test Shared gRPC server-based Implementations #233

What problem are you facing?

How could Terrajet help solve your problem?