crossplane-contrib / provider-upjet-aws

Official AWS Provider for Crossplane by Upbound.
https://marketplace.upbound.io/providers/upbound/provider-aws
Apache License 2.0
141 stars 121 forks source link

Re-enable shared TF provider usage #86

Closed muvaf closed 1 year ago

muvaf commented 1 year ago

What problem are you facing?

As per https://github.com/crossplane/terrajet/issues/233 , there is memory and CPU usage improvements when we use a single instance of TF provider to be shared by all terraform CLI calls. However, we had to disable it once we discovered that TF AWS provider uses credentials of another resource in some cases, which causes resource not found or resource already exists errors. First seen in https://github.com/crossplane-contrib/provider-jet-aws/issues/220

How could Official AWS Provider help solve your problem?

We need to fix the problem in TF AWS provider, bump the version and run Uptest to make sure all works. There is currently no issue in https://github.com/hashicorp/terraform-provider-aws that talks about the problem as far as I'm aware since shared mode is used only in tests by TF community but https://github.com/hashicorp/terraform-provider-aws/issues/26130 and https://github.com/hashicorp/go-plugin/issues/215 could be related which are mentioned in https://github.com/crossplane/terrajet/issues/305 .

I think the next steps are as following:

  1. Create a script that will reproduce the resource leak reliably without any Crossplane involvement - only TF provider and TF.
  2. Open an issue in https://github.com/hashicorp/terraform-provider-aws
  3. Dig into the code base and try to fix the problem.
  4. Once the fix is in and a new release is out, update TF AWS provider version used here. Run uptest-all action to test all resources.
  5. Run tests that can reproduce the initial problem and see if they're reproducible once we enable the shared provider mode back.
  6. Once we prove the fix works, enable it and cut a new minor release.
turkenh commented 1 year ago

I could reliably reproduce the underlying issue without getting Crossplane provider involved as follows:

  1. Started Terraform Provider AWS with --debug flag and noted down the TF_REATTACH_PROVIDERS value printed.
  2. On Terminal A, set TF_REATTACH_PROVIDERS value and cd into a directory with the following main.tf:
    
    terraform {
    required_providers {
    aws = {
      source = "hashicorp/aws"
      version = "4.35.0"
    }
    }
    }

provider "aws" { region = "us-west-1" }

resource "aws_vpc" "main" { cidr_block = "10.0.0.0/16" tags = { Name = "HTSharedTest-us-west-1" } }

3. On **Terminal B**, set `TF_REATTACH_PROVIDERS` value and cd into a directory with the following main.tf:
```tf
terraform {
  required_providers {
    aws = {
      source = "hashicorp/aws"
      version = "4.35.0"
    }
  }
}

provider "aws" {
    region = "us-west-2"
}

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  tags = {
    Name = "HTSharedTest-us-west-2"
  }
}
  1. On both Terminal A & B run the following simultaneously:
    terraform apply -auto-approve -input=false -lock=false -json&
    terraform apply -auto-approve -input=false -lock=false -json&
    terraform apply -auto-approve -input=false -lock=false -json&
    terraform apply -auto-approve -input=false -lock=false -json&
    terraform apply -auto-approve -input=false -lock=false -json&

Observed different behaviours on different attempts as follows:

If I run them sequentially, only 1 VPC per region is created as expected.

Digging in further to see what is really going on...

turkenh commented 1 year ago

TL;DR;

This seems to be the expected behavior when you run Terraform in test mode, i.e., shared GRPC mode with --debug. It is not something at the AWS Provider level but is more fundamental, like related to how Terraform plugin mechanism is designed.


The problem here is in this mode of operation, we have a shared provider instance, and ConfigureProvider RPC operates on the same instance when parallel executions are going on. I've found this issue which identifies the same problem with a simple provider implementation. The response there summarizes the root cause as follows:

Normally when Terraform interacts with a provider while executing a command, it spins up individual provider processes per-alias (otherwise called a provider instance) and shuts down the provider processes between commands. In this case, each provider instance receives its own ConfigureProvider RPC and therefore there is no shared process memory between them.

When you run a provider in debugging mode, it becomes a singular, long-running process. From Terraform's perspective, there is a single reattachment address for all provider instances with the same provider address (e.g. registry.terraform.io/example/alias-bug), so Terraform will make all the normal RPCs against that single provider process. In this case, the process will receive the multiple, but differing ConfigureProvider RPCs, which can occur at different times depending on how the graph is constructed and how nodes are executed concurrently. If you run Terraform with TRACE logging enabled, e.g. TF_LOG=TRACE terraform apply, the log output should show multiple ConfigureProvider RPCs being handled.

The provider debugging feature was originally designed for debugging a single provider process, or put differently in this context, a single provider instance/alias. Terraform only supports a single reattach process per provider address currently. It would require some redesign to support multiple instances/aliases for debugging. You would need to spin up providers with differing provider addresses and expose those multiple addresses in the reattach configuration to workaround the current design.

The PR adding support for this mode in Terraform binary mentions the following caveat, which is relevant here:

Unmanaged providers are expected to work in the exact same way as managed providers, with one caveat: Terraform kills provider processes and restarts them once per graph walk, meaning multiple times during most Terraform CLI commands. As unmanaged providers can't be killed by Terraform, and have no visibility into graph walks, unmanaged providers are likely to have differences in how their global mutable state behaves when compared to managed providers. Namely, unmanaged providers are likely to retain global state when managed providers would have reset it. Developers relying on global state should be aware of this.

Another example of this is the expected behavior; this PR adds some warning if an already configured provider is reconfigured during some concurrent runs against a shared provider instance.

Finally a relevant note from the Terraform documentation:

It is important to start a provider in debug mode only when you intend to debug it, as its behavior will change in minor ways from normal operation of providers.

@ulucinar checked your shared provider implementation in Upjet and noticed that you're running the provider with TF_PLUGIN_MAGIC_COOKIE flag, which seems to put it in test mode. Do you think if it is exactly same as running with --debug flag or could there be differences?

turkenh commented 1 year ago

Currently, I am seeing the following options:

A. Keep shared GRPC mode disabled and use terraform just like an end user. B. Taking this limitation into account, consider running the provider in this mode but one per resource/providerconfig. This might improve performance by avoiding the forking provider process multiple times per reconciliation (e.g. a terraform apply run forking ~ 3 times). C. Work on a fix in Terraform provider plugin and propose it to Terraform community ...

For now, my vote would be for option A and focusing on other possible improvements on performance.

@ulucinar, I would really appreciate if you could check and let me know if I am missing something.

ulucinar commented 1 year ago

Hi @turkenh,

Do you think if it is exactly same as running with --debug flag or could there be differences?

IIRC, I had to include the TF_PLUGIN_MAGIC_COOKIE because it's set by the Terraform CLI while forking the native plugin. So my understanding is that you don't set the cookie when running the native plugin, right?

ulucinar commented 1 year ago

One other thing that we need to be careful about is that disabling the shared gRPC server (i.e., running the native plugin as a shared server between multiple clients) seems to have resolved certain duplication issues reported for provider-aws but we are still running provider-{gcp,azure} in that mode. My feeling is that the issues are provider specific (this is also supported by the fact that similar issues have not yet been observed with the other providers, or maybe we are not yet aware or not yet connected the dots), but still, we may decide to act cautiously and consider disabling this mode for others also.

I think we should seriously consider this (as this mode of operation is not what Terraform expects for production use) if we can improve performance by tweaking some other parameters and we find ourselves in a position where we think the performance gain we get from the shared gRPC server mode (on top of some future improvements) is not worth taking the risk.

I believe this is also inline with what you propose in A above (keep shared gRPC mode disabled and use terraform just like an end user).

turkenh commented 1 year ago

IIRC, I had to include the TF_PLUGIN_MAGIC_COOKIE because it's set by the Terraform CLI while forking the native plugin. So my understanding is that you don't set the cookie when running the native plugin, right?

Yes, this is how I was running the plugin:

./terraform-provider-aws_v4.35.0_x5 -debug
{"@level":"debug","@message":"plugin address","@timestamp":"2022-11-03T21:50:02.758995+03:00","address":"/var/folders/_5/jc7399qx6cn6t5535npv9z4c0000gn/T/plugin27659334","network":"unix"}
Provider started. To attach Terraform CLI, set the TF_REATTACH_PROVIDERS environment variable with the following:

    TF_REATTACH_PROVIDERS='{"registry.terraform.io/hashicorp/aws":{"Protocol":"grpc","ProtocolVersion":5,"Pid":2373,"Test":true,"Addr":{"Network":"unix","String":"/var/folders/_5/jc7399qx6cn6t5535npv9z4c0000gn/T/plugin27659334"}}}'

And, IIUC this is how we are running in upjet:

TF_PLUGIN_MAGIC_COOKIE=d602bf8f470bc67ca7faa0386276bbdd4330efaf76d1a219cb4d6991ca9872b2 ./terraform-provider-aws_v4.35.0_x5
{"@level":"debug","@message":"plugin address","@timestamp":"2022-11-03T21:50:14.119454+03:00","address":"/var/folders/_5/jc7399qx6cn6t5535npv9z4c0000gn/T/plugin3440378070","network":"unix"}
1|5|unix|/var/folders/_5/jc7399qx6cn6t5535npv9z4c0000gn/T/plugin3440378070|grpc|
turkenh commented 1 year ago

My feeling is that the issues are provider specific (this is also supported by the fact that similar issues have not yet been observed with the other providers, or maybe we are not yet aware or not yet connected the dots), but still, we may decide to act cautiously and consider disabling this mode for others also.

Based on my findings above, I would expect it to happen in others as well, however, we didn't see them because they were not tested/used under triggering conditions.

turkenh commented 1 year ago

we may decide to act cautiously and consider disabling this mode for others also.

@ulucinar it looks like it is already disabled for provider-azure and provider-gcp as well:

https://github.com/upbound/provider-azure/blob/main/cluster/images/provider-azure/Dockerfile#L45 https://github.com/upbound/provider-gcp/blob/main/cluster/images/provider-gcp/Dockerfile#L45

ulucinar commented 1 year ago

Thanks @turkenh, I was not aware that we had disabled it for Azure and GCP. I wonder whether it was intentional: https://github.com/upbound/provider-azure/pull/4 https://github.com/upbound/provider-gcp/pull/5

It seems like we have disabled shared gRPC server while opensourcing those providers.

muvaf commented 1 year ago

Thank you for digging in @turkenh ! I am also leaning towards dropping shared provider altogether given that there is an inherent risk of a config/creds used in another resource due to the architecture. But at the same time, I received feedback from the community that it becomes too expensive to run these providers when the shared mode is disabled.

I wonder if we'd gain some CPU/memory/time if we spin up a provider instance in Connect and kill it in Disconnect, i.e. provider instance per reconcile instead of per terraform call. In normal times, we make 2 TF calls; one to observe and one to calculate the diff. So, we'd reduce the number of provider invocations by half compared to no-sharing but I'm not sure if that's mostly negligible or not because even when we used single shared provider, we got ~20% gain overall if understand the graphs correctly here. Maybe we can re-use https://github.com/ulucinar/terrajet-scale to assess the differences?

Either way though, I agree that we can start by increasing the poll-interval to something like 10 minutes and making it configurable to alleviate some of the pain.

jeanduplessis commented 1 year ago

Closing this issue as the default poll interval and the ability to configure it has been addressed with https://github.com/upbound/provider-aws/pull/140