Closed ulucinar closed 2 years ago
Here are the results from two experiments on the provider-jet-azure
.
Experiment Setup:
On a GKE cluster with the following specs:
Machine family: General purpose e2-standard-4 (4 vCPU, 16 GB memory)
3 worker nodes
Control plane version - v1.20.11-gke.1300
1
worker-count for two test cases.Note: For previous tests and general context, please see this issue: https://github.com/crossplane/terrajet/issues/55
Case 1: Test provider-jet-azure
v0.8.0
version without shared gRPC implementation
For this case the following image was used: crossplane/provider-jet-azure:v0.8.0
Firstly, provider-jet-azure
v0.8.0
was deployed to the cluster. Then 50 VirutalNetwork
and 50 LoadBalancer
MRs were created simultaneously. Total = 100 MRs
An example invocation of the generator script and an example generated MR manifest looks like the following:
$ ./manage-mr.sh create ./loadbalancer.yaml $(seq 1 50)
$ ./manage-mr.sh create ./virtualnetwork.yaml $(seq 1 50)
The following graphs are observed from the Grafana Dashboard:
The chart above shows us the MR counts and CPU/Memory Utilizations. As can be seen, CPU usage peaked at the beginning of the resource creation process and reached a level close to 40%. Although there are various fluctuations afterward, it can be said that the CPU usage from the beginning to the end of the test is between 25-30% on average.
The data above and the histogram show the time it takes to get ready for 100 created resources.
Case 2: Test provider-jet-azure
with shared gRPC implementation
For this case the following image was used: ulucinar/provider-jet-azure-amd64:shared-grpc
Firstly, provider-jet-azure
(with a custom image that contains the gRPC implementation) was deployed to the cluster. Then 50 VirutalNetwork
and 50 LoadBalancer
MRs were created simultaneously. Total = 100 MRs
An example invocation of the generator script and an example generated MR manifest looks like the following:
$ ./manage-mr.sh create ./loadbalancer.yaml $(seq 1 50)
$ ./manage-mr.sh create ./virtualnetwork.yaml $(seq 1 50)
The following graphs are observed from the Grafana Dashboard:
The chart above shows us the MR counts and CPU/Memory Utilizations. As can be seen, CPU usage peaked at the beginning of the resource creation process and reached a level close to 21-22%. Although there are various fluctuations afterward, it can be said that the CPU usage from the beginning to the end of the test is between 15% on average.
Note: While testing this case, any stability issues were not observed, such as restarting the provider pod.
The data above and the histogram show the time it takes to get ready for 100 created resources.
Result:
When we check the CPU/Memory utilizations, we see a decrease. Both of average and peak values are lower in the gRPC based implementation case.
For Readiness time, all of the statistics show us the improvement in the gRPC based implementation case.
In the light of the above results, it is possible to say that the gRPC implementation makes a significant difference both in terms of resource consumption (CPU/memory) and the time it takes for resources to become Ready.
Thank you @sergenyalcin for carrying out these experiments, excellent work! Could you please also record the shared gRPC implementation image you have used in the experiments in your comment?
@sergenyalcin it may also be helpful to record in your comment that we have not observed any stability issues with the shared gRPC server in your experiments as it depends on non-production (testing) Terraform configuration. One important aspect of these experiments is to observe the stability of the shared gRPC implementation under load.
@ulucinar thank you for your comments. Both of two comments were addressed!
Thanks @sergenyalcin ! I think we can conclude and close this issue and also https://github.com/crossplane/terrajet/issues/38 . The only risk seems to be that we'll be using an undocumented path but it's quite easy to turn on/off with a config so provider maintainers can choose whether they'd like to take the risk or not. @sergenyalcin @ulucinar do you agree?
The next step could be to open an issue targeting implementation of gRPC usage. Once an example usage is in provider-jet-template
, we can update the guide and the Jet providers we're maintaining to that method.
@muvaf I think we can close this issue as you suggest.
As a summary of this scale tests, when the gRPC server-based implementation is used, there are significant improvements both in terms of resource consumption (CPU/memory) and the time it takes for managed resources to become Ready.
We can open another issue for tracking the implementation of gRPC usage.
What problem are you facing?
We have produced a shared gRPC server-based implementation for
provider-jet-azure
in the context of https://github.com/crossplane/terrajet/issues/38. Theprovider-jet-azure
packagesulucinar/provider-jet-azure-arm64:shared-grpc
andulucinar/provider-jet-azure-amd64:shared-grpc
are modified to run theterraform-provider-azurerm
binary plugin in the background as a shared gRPC server and the Terraform CLI does not have to fork the binary plugin for each of its requests.How could Terrajet help solve your problem?
Similar to what we have done previously in https://github.com/crossplane/terrajet/issues/55, we need to reevaluate the performance of
provider-jet-azure@v0.7.0
and also the shared gRPC implementation using the above provider packages. This will allow us to assess and quantify any performance improvements with the shared gRPC implementation. Some of the previously used scripts for #55 are available in https://github.com/ulucinar/terrajet-scale.