busser / tfautomv

Generate Terraform moved blocks automatically for painless refactoring
https://github.com/busser/tfautomv
Apache License 2.0
705 stars 23 forks source link

Ignore replaced resources when creating moved blocks #84

Open untcha opened 9 months ago

untcha commented 9 months ago

We found another issue when using tfautomv.

The following shows automatically created moved blocks for 2 replaced "aws_route" resources. Which were cross-moved one to the other.

...
┌─ Summary
│ tfautomv made 177 comparisons and found 16 moves
│
│ 16 moves within current directory
│ ├─
│ │ from aws_route.dynamic["vpc-shared-services-tgw-next-prod_intra_0.0.0.0/0"]
│ │ to   aws_route.dynamic["vpc-shared-services-tgw-next-prod_web_0.0.0.0/0"]
│ ├─
│ │ from aws_route.dynamic["vpc-shared-services-tgw-next-prod_web_0.0.0.0/0"]
│ │ to   aws_route.dynamic["vpc-shared-services-tgw-next-prod_intra_0.0.0.0/0"]
│ ├─
...

Resulting in this error:

...
┌─ Error
│ failed to get plan for workdir ".": failed to compute Terraform plan: exit status 1
│
│ Error: Moved object still exists
│
│   on moves.tf line 1:
│    1: moved {
│
│ This statement declares a move from
│ aws_route.dynamic["vpc-shared-services-tgw-next-prod_intra_0.0.0.0/0"], but
│ that resource instance is still declared at tgw-attachment.tf:123,14.
│
│ Change your configuration so that this instance will be declared as
│ aws_route.dynamic["vpc-shared-services-tgw-next-prod_web_0.0.0.0/0"] instead.
│
│ Error: Moved object still exists
│
│   on moves.tf line 5:
│    5: moved {
│
│ This statement declares a move from
│ aws_route.dynamic["vpc-shared-services-tgw-next-prod_web_0.0.0.0/0"], but
│ that resource instance is still declared at tgw-attachment.tf:123,14.
│
│ Change your configuration so that this instance will be declared as
│ aws_route.dynamic["vpc-shared-services-tgw-next-prod_intra_0.0.0.0/0"]
│ instead.
│
└─
...

The following is an original terraform plan output log when NOT using tfautomv which clearly proves that terraform already realized that the resource address has not changed.

We believe that tfautomv should never act on replaced resources.

  # aws_route.dynamic["vpc-shared-services-tgw-next-prod_intra_0.0.0.0/0"] must be replaced
-/+ resource "aws_route" "dynamic" {
      ~ id                     = "r-rtb-06b06996b02bf5fd11080289494" -> (known after apply)
      + instance_id            = (known after apply)
      + instance_owner_id      = (known after apply)
      + network_interface_id   = (known after apply)
      ~ origin                 = "CreateRoute" -> (known after apply)
      ~ route_table_id         = "rtb-06b06996b02bf5fd1" -> (known after apply) # forces replacement
      ~ state                  = "active" -> (known after apply)
        # (2 unchanged attributes hidden)
    }

  # aws_route.dynamic["vpc-shared-services-tgw-next-prod_web_0.0.0.0/0"] must be replaced
-/+ resource "aws_route" "dynamic" {
      ~ id                     = "r-rtb-089a8c6633ab16d9a1080289494" -> (known after apply)
      + instance_id            = (known after apply)
      + instance_owner_id      = (known after apply)
      + network_interface_id   = (known after apply)
      ~ origin                 = "CreateRoute" -> (known after apply)
      ~ route_table_id         = "rtb-089a8c6633ab16d9a" -> (known after apply) # forces replacement
      ~ state                  = "active" -> (known after apply)
        # (2 unchanged attributes hidden)
    }

In pkg/engine/plan.go in the SummarizeJSONPlan function we wanted to identify the involved change actions. The "replace" results from the following delete and create actions in combination, which we simply printed out.

fmt.Printf("Change Action: %s for resource address: %s\n", rc.Change.Actions, rc.Address)

Because tfautomv acts on each delete and create individually it tries to find an unnecessary "move partner" resource and only the other replaced one is available.

...
Change Action: [no-op] for resource address: aws_iam_policy.vpc_flow_log_cloudwatch[0]
Change Action: [no-op] for resource address: aws_iam_role.vpc_flow_log_cloudwatch[0]
Change Action: [no-op] for resource address: aws_iam_role_policy_attachment.vpc_flow_log_cloudwatch[0]
Change Action: [delete create] for resource address: aws_route.dynamic["vpc-shared-services-tgw-next-prod_intra_0.0.0.0/0"]
Change Action: [delete create] for resource address: aws_route.dynamic["vpc-shared-services-tgw-next-prod_web_0.0.0.0/0"]
Change Action: [no-op] for resource address: aws_route53_resolver_query_log_config.this[0]
Change Action: [no-op] for resource address: aws_route53_resolver_query_log_config_association.this[0]
Change Action: [no-op] for resource address: aws_route53_resolver_rule.system
...
busser commented 9 months ago

Thanks for reporting this!

I believe this is a consequence of a core assumption that tfautomv makes: if it finds all the correct moves then Terraform's plan will be empty. By that assumption, if Terraform plans to replace a resource it's because the original resource was moved elsewhere and another resource was moved to where the first resource was. My understanding is that in your case this assumption was wrong, and that the aws_route resources were really changing. But a wrong assumption is no excuse to generate broken code.

I'm trying to wrap my head around how tfautomv should handle this sort of situation. I don't think that ignoring resources planned for replacement is the solution, because there are legitimate cases where such resources need to be moved. For instance:

- resource "random_pet" "alpha" {
+ resource "random_pet" "bravo" {
    length = 2
  }

- resource "random_pet" "bravo" {
+ resource "random_pet" "charlie" { 
    length = 3
  }

In this example, Terraform would plan to replace the random_pet.bravo resource but it was in fact moved to random_pet.charlie and replaced by random_pet.alpha. That being said, I'm not sure that tfautomv handles that case perfectly either 😅 In this example, tfautomv would need to generate 2 moves in a specific order, and I don't think it has logic to make sure of that.

The specific issue you describe is a bit more complex, because there is no valid order for the moves. Here's a variation on the previous example:

- resource "random_pet" "alpha" {
+ resource "random_pet" "bravo" {
    length = 2
  }

- resource "random_pet" "bravo" {
+ resource "random_pet" "alpha" { 
    length = 3
  }

In this case, tfautomv would need to swap the two resources. There's no way to do that atomically (Terraform processes moves one by one). I don't know if it's possible to move a resource to a third temporary address, with no matching code, or if this case simply can't be handled and tfautomv should log that it needs the user's help here.

Now back to your specific case. I think the issue can be reproduced this way (I'll need to actually test this later):

  resource "random_integer" "alpha" {
    min = 1
-   max = 2
+   max = 3
  }

  resource "random_pet" "alpha" {
    length = random_integer.alpha.result
  }

  resource "random_integer" "bravo" {
    min = 1
-   max = 2
+   max = 3
  }

  resource "random_pet" "bravo" {
    length = random_integer.bravo.result
  }

In this example, both random_pet.alpha and random_pet.bravo are planned for replacement. The question is: how is tfautomv supposed to know that it shouldn't move these random_pet resources? I think the answer may be: because they have multiple matches. In this case, random_pet.alpha matches both random_pet.bravo and itself. Therefore, tfautomv can't be sure where to move so it shouldn't move it.

I think that the bug here isn't that tfautomv moves resources planned for replacement, but rather that tfautomv doesn't consider the possibility of a resource matching itself.

This is a pretty subtle bug so I may be missing something. What is your take on this? 🤔

untcha commented 8 months ago

Hi Arthur,

first, sorry for the late reply (we were very busy with a big (state) migration - very successful thanks to tfautomv 😃), and thanks a lot for the detailed answer.

After we looked at your examples above, we believe that there are cases where the generation of moved blocks for replace actions is meaningful. But since tfautomv is currently not able to reliably create these moved blocks in the correct order, we believe that the default should be to not act on terraform replace action.

Instead, we propose to add a parameter/flag e.g. --force-replace to allow generation of moved blocks as it is currently implemented in latest version of tfautomv, but without this parameter/flag no moved blocks should be created for replace actions.

What do you think about this proposal?