Kafka connection stalled during plan

MattPumphrey commented 4 years ago

So I am in the process of Migrating all of our topics from our chef setup to use this provider, however I am having some issues with the closing of connections. We have over 500 topics in one environment, so via 3 environments this is going to get a bit sticky.

I had to turn Trace on to find it but here is an issue I am running into

2020/02/05 12:26:46 [TRACE] UpgradeResourceState: schema version of kafka_topic.prt_forecast_legacy_import is still 0; calling provider "registry.terr
aform.io/-/kafka" for any other minor fixups
2020/02/05 12:26:46 [TRACE] GRPCProvider: UpgradeResourceState
2020/02/05 12:26:46 [TRACE] <root>: eval: *terraform.EvalRefreshDependencies
2020/02/05 12:26:46 [TRACE] <root>: eval: *terraform.EvalRefresh
2020/02/05 12:26:46 [TRACE] GRPCProvider: ReadResource
2020/02/05 12:26:48 [TRACE] dag/walk: vertex "root" is waiting for "provider.kafka (close)"
2020/02/05 12:26:49 [TRACE] dag/walk: vertex "provider.kafka (close)" is waiting for "kafka_topic.comments_notifications"
2020/02/05 12:26:53 [TRACE] dag/walk: vertex "root" is waiting for "provider.kafka (close)"
2020/02/05 12:26:54 [TRACE] dag/walk: vertex "provider.kafka (close)" is waiting for "kafka_topic.comments_notifications"
2020/02/05 12:26:58 [TRACE] dag/walk: vertex "root" is waiting for "provider.kafka (close)"
2020/02/05 12:26:59 [TRACE] dag/walk: vertex "provider.kafka (close)" is waiting for "kafka_topic.comments_notifications"
2020/02/05 12:27:03 [TRACE] dag/walk: vertex "root" is waiting for "provider.kafka (close)"
2020/02/05 12:27:04 [TRACE] dag/walk: vertex "provider.kafka (close)" is waiting for "kafka_topic.comments_notifications"
2020/02/05 12:27:08 [TRACE] dag/walk: vertex "root" is waiting for "provider.kafka (close)"
2020/02/05 12:27:09 [TRACE] dag/walk: vertex "provider.kafka (close)" is waiting for "kafka_topic.comments_notifications"
2020/02/05 12:27:13 [TRACE] dag/walk: vertex "root" is waiting for "provider.kafka (close)"
2020/02/05 12:27:14 [TRACE] dag/walk: vertex "provider.kafka (close)" is waiting for "kafka_topic.comments_notifications"
2020/02/05 12:27:18 [TRACE] dag/walk: vertex "root" is waiting for "provider.kafka (close)"
2020/02/05 12:27:19 [TRACE] dag/walk: vertex "provider.kafka (close)" is waiting for "kafka_topic.comments_notifications"
2020/02/05 12:27:23 [TRACE] dag/walk: vertex "root" is waiting for "provider.kafka (close)"
2020/02/05 12:27:24 [TRACE] dag/walk: vertex "provider.kafka (close)" is waiting for "kafka_topic.comments_notifications"
2020/02/05 12:27:28 [TRACE] dag/walk: vertex "root" is waiting for "provider.kafka (close)"
2020/02/05 12:27:29 [TRACE] dag/walk: vertex "provider.kafka (close)" is waiting for "kafka_topic.comments_notifications"
2020/02/05 12:27:33 [TRACE] dag/walk: vertex "root" is waiting for "provider.kafka (close)"
2020/02/05 12:27:34 [TRACE] dag/walk: vertex "provider.kafka (close)" is waiting for "kafka_topic.comments_notifications"

kafka-log-path.log

Mongey commented 4 years ago

Having a bit of trouble seeing what the actual issue is. Given the title, it seems to just sit there doing nothing?

MattPumphrey commented 4 years ago

Thats generally correct, on this one cluster we have 95 topics. When doing the plan, it connects and then is unable to close causing any plan to fail with the automation we are using, mainly Atlantis. But even when testing this, I can get through our Dev and Preprod environment, and this might be a broker issue. I am going to be testing this again on Friday to see if it has resolved that issue. But the issue that it couldnt close the connection or at least timeout after a certain time or error out causes an issue.

mmajis commented 4 years ago

Looks like I have this same issue with ACLs. We have 455 ACL entries and the provider (v0.2.2 and v0.2.3) is consistently hanging when trying to run terraform plan with terraform 0.12.16 or 0.12.21 on linux amd64 in a gitlab docker runner container (alpine-3.9 and alpine-3.11) as well as docker on macOS.

Meanwhile, I'm able to run the same plan successfully on native macOS with terraform 0.12.16 and terraform-provider-kafka v0.2.2.

In the alpine container, the output stops like this with trace enabled:

...many repetitions of the same lines for different ACL entries...
2020-02-27T05:33:10.217Z [DEBUG] plugin.terraform-provider-kafka_v0.2.2: 2020/02/27 05:33:10 [INFO] Found ACL &{{2 redacted 4} [0xc000676690 0xc0006766c0 0xc0006766f0 0xc000676720 0xc000676750 0xc000676780 0xc0006767b0 0xc0006767e0]}
module.redacted.module.redacted.kafka_acl.allow_topic_read: Refreshing state... [id=User:redacted|*|Read|Allow|Topic|redacted|Prefixed]
2020-02-27T05:33:10.219Z [DEBUG] plugin.terraform-provider-kafka_v0.2.2: 2020/02/27 05:33:10 [INFO] Reading ACL
2020-02-27T05:33:10.219Z [DEBUG] plugin.terraform-provider-kafka_v0.2.2: 2020/02/27 05:33:10 [INFO] Reading ACL User:redacted|*|Read|Allow|Topic|redacted|Prefixed
2020-02-27T05:33:10.219Z [DEBUG] plugin.terraform-provider-kafka_v0.2.2: 2020/02/27 05:33:10 [INFO] Listing all ACLS

Mongey commented 4 years ago

Thanks @mmajis , I'll try get this resolved this weekend.

mmajis commented 4 years ago

I wasn't able to reproduce this with a local simple single broker cluster, but it does reproduce with Confluent Cloud.

The initial terraform plan with new resources introduced was successful as was the subsequent terraform apply. But running a plan after that stalls.

Here's a trace log from a run of 500 ACLs as well as the ACL entries I used.

repro.zip

mmajis commented 4 years ago

Looks like #103 fixes the ACL plan issue. Could not reproduce the stall anymore. Thanks @Mongey !

Mongey / terraform-provider-kafka

Kafka connection stalled during plan #97