GoogleCloudPlatform / DataflowTemplates

Cloud Dataflow Google-provided templates for solving in-Cloud data tasks
https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
Apache License 2.0
1.15k stars 969 forks source link

[Bug]: Kafka integration test log spam #1800

Open Abacn opened 2 months ago

Abacn commented 2 months ago

Related Template(s)

N/A

Template Version

N/A

What happened?

~Java PR Action has long been flaky, at least in the past it clearly shows which tests failed. However,~ recently the test log size has increased substantially, likely due to Kafka template and test development. Now the log is of > 80 MB.

A majority of the log reads

2024-08-19T15:39:29.6622952Z [kafka-producer-network-thread | producer-7] INFO org.apache.kafka.clients.NetworkClient - [Producer clientId=producer-7] Node 1 disconnected.
2024-08-19T15:39:29.6626237Z [kafka-producer-network-thread | producer-7] WARN org.apache.kafka.clients.NetworkClient - [Producer clientId=producer-7] Connection to node 1 (/10.128.0.40:57609) could not be established. Node may not be available.
2024-08-19T15:39:29.6629592Z [kafka-admin-client-thread | adminclient-3] INFO org.apache.kafka.clients.NetworkClient - [AdminClient clientId=adminclient-3] Node 1 disconnected.
2024-08-19T15:39:29.6633503Z [kafka-admin-client-thread | adminclient-3] WARN org.apache.kafka.clients.NetworkClient - [AdminClient clientId=adminclient-3] Connection to node 1 (/10.128.0.40:57609) could not be established. Node may not be available.

and

2024-08-19T13:31:21.7130918Z [docker-java-stream--996043679] INFO org.apache.beam.it.testcontainers.TestContainerResourceManager - confluentinc/cp-kafka:7.3.1: [2024-08-19 13:23:47,251] INFO [Broker id=1] Handling LeaderAndIsr request correlationId 1 from controller 1 for 5 partitions (state.change.logger)
2024-08-19T13:31:21.7133326Z 
2024-08-19T13:31:21.7143323Z [docker-java-stream--996043679] INFO org.apache.beam.it.testcontainers.TestContainerResourceManager - confluentinc/cp-kafka:7.3.1: [2024-08-19 13:23:47,253] TRACE [Broker id=1] Received LeaderAndIsr request LeaderAndIsrPartitionState(topicName='testkafkatogcsbinaryencoding-20240819-132347-044185', partitionIndex=0, controllerEpoch=1, leader=1, leaderEpoch=0, isr=[1], partitionEpoch=0, replicas=[1], addingReplicas=[], removingReplicas=[], isNew=true, leaderRecoveryState=0) correlation id 1 from controller 1 epoch 1 (state.change.logger)

Relevant log output

No response

AnandInguva commented 2 months ago

I think we need to separate the Kafka IT tests from Java tests since there are so many Kafka IT tests. Having a separate GH action for kafka tests. would solve this partially

AnandInguva commented 2 months ago

I can take a look at this later this week and put up a PR that separates the Kafka tests into a different GH actions suite.

Abacn commented 2 months ago

The log spam causes the mvn command stuck for 30 minutes to process logs to console, between the last test run and finalize, see https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/1817#issuecomment-2317640571

We have to disable printing logs until resolved.

AnandInguva commented 2 months ago

@Abacn John mentioned that there is a way to configure Kafka to emit fewer logs.

We have to disable printing logs until resolved. What do you mean by this?

Abacn commented 2 months ago

@Abacn John mentioned that there is a way to configure Kafka to emit fewer logs.

We have to disable printing logs until resolved. What do you mean by this?

Like replace "-e" to "-q" here: https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/3b59345fd3847ed301395a37e67e394630f4e60f/cicd/internal/op/maven.go#L32 to see if the workflow no longer stuck at processing log

Abacn commented 2 months ago

Try to set loglevel but did not work

LogManager logManager = LogManager.getLogManager();
  java.util.logging.Logger rootLogger = logManager.getLogger("");
  rootLogger.setLevel(Level.WARNING);

or

java.util.logging.Logger logger = java.util.logging.Logger.getLogger("org.apache.beam.it.testcontainers.TestContainerResourceManager");
logger.setLevel(Level.WARNING);

likely because it was added as a log consumer to test container:

https://github.com/apache/beam/blob/6901d7c862388ded58e1cda3286c429edab58c7c/it/testcontainers/src/main/java/org/apache/beam/it/testcontainers/TestContainerResourceManager.java#L75

Abacn commented 1 month ago

The original issue title noted two separate issue (Java test flaky / Kafka log spam), now Kafka PR has been separated from Java PR, though the log spam issue still present. Changed the title and assigned to kafka test owner