aws-samples / msk-config-providers

MIT No Attribution
15 stars 5 forks source link

NullPointerException instantiating SecretsManagerConfigProvider #26

Open vtstanescu opened 3 months ago

vtstanescu commented 3 months ago

Hello,

We are encountering NPEs when using the Config provider for Secrets Manager, we haven't tried any other. These seem to occur randomly, but frequently, when a worker or connector is started for the first time. We eventually had to hardcode values in the connector config to get us going in a connector running across multiple workers.

Kafka version is 3.6.0 Java distribution and version is Amazon Corretto 22 MSK config providers version is 0.2.0

A connector /stop & /resume seemed to resolve the issue in the past when we were using only 1 worker, but with multiple workers, the issues pops up randomly on 1 or more workers.

[2024-08-26 06:58:35,156] ERROR [Worker clientId=connect-1, groupId=connect-<redacted>] Couldn't instantiate task <redacted>-5 because it has an invalid task configuration. This task will not execute until reconfigured. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1921)
java.lang.NullPointerException: Cannot invoke "software.amazon.awssdk.utils.AttributeMap$Value.get(software.amazon.awssdk.utils.AttributeMap$LazyValueSource)" because "value" is null
    at software.amazon.awssdk.utils.AttributeMap$Builder.resolveValue(AttributeMap.java:396)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1597)
    at software.amazon.awssdk.utils.AttributeMap$Builder.build(AttributeMap.java:362)
    at software.amazon.awssdk.core.client.config.SdkClientConfiguration$Builder.build(SdkClientConfiguration.java:232)
    at software.amazon.awssdk.core.client.builder.SdkDefaultClientBuilder.syncClientConfiguration(SdkDefaultClientBuilder.java:178)
    at software.amazon.awssdk.services.secretsmanager.DefaultSecretsManagerClientBuilder.buildClient(DefaultSecretsManagerClientBuilder.java:29)
    at software.amazon.awssdk.services.secretsmanager.DefaultSecretsManagerClientBuilder.buildClient(DefaultSecretsManagerClientBuilder.java:22)
    at software.amazon.awssdk.core.client.builder.SdkDefaultClientBuilder.build(SdkDefaultClientBuilder.java:155)
    at com.amazonaws.kafka.config.providers.SecretsManagerConfigProvider.checkOrInitSecretManagerClient(SecretsManagerConfigProvider.java:173)
    at com.amazonaws.kafka.config.providers.SecretsManagerConfigProvider.get(SecretsManagerConfigProvider.java:134)
    at org.apache.kafka.common.config.ConfigTransformer.transform(ConfigTransformer.java:103)
    at org.apache.kafka.connect.runtime.WorkerConfigTransformer.transform(WorkerConfigTransformer.java:58)
    at org.apache.kafka.connect.storage.ClusterConfigState.connectorConfig(ClusterConfigState.java:152)
    at org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:1866)
    at org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$getTaskStartingCallable$35(DistributedHerder.java:1919)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1570)
bdesert commented 2 months ago

checking on this

vtstanescu commented 1 month ago

Hi, @bdesert! Any updates on this?

I'm wondering if this can be caused by missing network connectivity between the workers of the Kafka Connect cluster.

We are using the distributed connect, so one worker is the leader of the connector. I'm not sure about the implementation details in Kafka Connect, but I assume the leader worker of the connector generates the "rendered" config by reading data from the config providers and then passes the "rendered" config to all workers that are running tasks for the connector.
If the workers cannot communicate among themselves, eg. EC2 security group rules are missing, it may cause this issue.

We are using a self-hosted Kafka Connect cluster in AWS, and will test the usage of secretsmanager config provider again, to see if the issue reoccurs after we've added the needed SGR to allow traffic between the Kafka Connect workers. I don't have high expectations, though, as we've seen the same NPEs when running our connector in Amazon MSK Connect service, but let's see.

bdesert commented 3 weeks ago

hi @vtstanescu , ok, got my head around it and I think I have solution for this. I think the issues is caused by some type of expiration happening because I keep the builder open without rebuilding it. I think, if we catch this exception and rebuild the builder, this should solve the issue. alternatively we can create the builder every time, but this will be expensive (performance wise), as kafka connect is re-evaluating configuration all the time, using cached client would be way faster. so, I'll provide a fix in the next week or two and will tag you on PR to review