Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.33k stars 1.05k forks source link

Startup in Full Backend Tests sometimes fails #15997

Closed janheise closed 12 months ago

janheise commented 1 year ago

Current Behavior

Sometimes a timeout occurs running the full backend tests. It looks as if all the containers start but index creation is not finished until the timeout occurs. The log output (see below, rather large) indicates, that the graylog container is up and running. We do not output the logs of the OpenSearch container with this type of error, so it's unclear if there is a hangup here or if it's just bad performance.

Possible Solution

The final parts of the GL logs: the "doesn't exist" repeats until the timeout

17:41:17  2023-07-19 15:31:27,464 INFO : org.graylog2.bootstrap.ServerBootstrap - Services started, startup times in ms: {LocalKafkaMessageQueueReader [RUNNING]=0, GracefulShutdownService [RUNNING]=0, UserSessionTerminationService [RUNNING]=0, LocalKafkaMessageQueueWriter [RUNNING]=0, UrlWhitelistService [RUNNING]=0, FailureHandlingService [RUNNING]=0, InputSetupService [RUNNING]=0, PrometheusExporter [RUNNING]=0, BufferSynchronizerService [RUNNING]=1, OutputSetupService [RUNNING]=1, MongoDBProcessingStatusRecorderService [RUNNING]=1, EtagService [RUNNING]=2, GeoIpDbFileChangeMonitorService [RUNNING]=3, StreamCacheService [RUNNING]=4, LookupTableService [RUNNING]=5, LocalKafkaJournal [RUNNING]=7, JobSchedulerService [RUNNING]=13, ConfigurationEtagService [RUNNING]=15, PeriodicalsService [RUNNING]=17, JerseyService [RUNNING]=1082}
17:41:17  2023-07-19 15:31:27,464 INFO : org.graylog2.shared.initializers.InputSetupService - Triggering launching persisted inputs, node transitioned from Uninitialized [LB:DEAD] to Running [LB:ALIVE]
17:41:17  2023-07-19 15:31:27,466 INFO : org.graylog2.bootstrap.ServerBootstrap - Graylog server up and running.
17:41:17  2023-07-19 15:31:34,396 WARN : org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical - Active write index for index set "Default index set" (64b801cc4aa9d946dec5a8c5) doesn't exist yet
17:41:17  2023-07-19 15:31:34,396 WARN : org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical - Active write index for index set "Graylog Events" (64b801cd4aa9d946dec5a97f) doesn't exist yet
17:41:17  2023-07-19 15:31:34,400 WARN : org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical - Active write index for index set "Graylog System Events" (64b801cd4aa9d946dec5a982) doesn't exist yet
17:41:17  2023-07-19 15:31:40,396 WARN : org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical - Active write index for index set "Default index set" (64b801cc4aa9d946dec5a8c5) doesn't exist yet
17:41:17  2023-07-19 15:31:40,396 WARN : org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical - Active write index for index set "Graylog Events" (64b801cd4aa9d946dec5a97f) doesn't exist yet
17:41:17  2023-07-19 15:31:40,396 WARN : org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical - Active write index for index set "Graylog System Events" (64b801cd4aa9d946dec5a982) doesn't exist yet
17:41:17  2023-07-19 15:31:41,399 INFO : org.graylog2.periodical.IndexRangesCleanupPeriodical - Skipping index range cleanup because the Elasticsearch cluster is unreachable or unhealthy
17:41:17  2023-07-19 15:31:46,396 WARN : org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical - Active write index for index set "Graylog Events" (64b801cd4aa9d946dec5a97f) doesn't exist yet

Stacktrace:


17:41:17  org.testcontainers.containers.ContainerLaunchException: Container startup failed
17:41:17    at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:349) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:322) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    at org.graylog.testing.graylognode.NodeContainerFactory.createRunningContainer(NodeContainerFactory.java:186) ~[graylog-tests.jar:?]
17:41:17    at org.graylog.testing.graylognode.NodeContainerFactory.buildContainer(NodeContainerFactory.java:68) ~[graylog-tests.jar:?]
17:41:17    at org.graylog.testing.graylognode.NodeInstance.createStarted(NodeInstance.java:45) ~[graylog-tests.jar:?]
17:41:17    at org.graylog.testing.completebackend.ContainerizedGraylogBackend.create(ContainerizedGraylogBackend.java:86) ~[graylog-tests.jar:?]
17:41:17    at org.graylog.testing.completebackend.ContainerizedGraylogBackend.createStarted(ContainerizedGraylogBackend.java:59) ~[graylog-tests.jar:?]
17:41:17    at org.junit.platform.engine.support.hierarchical.ContainerMatrixHierarchicalTestEngine.lambda$execute$0(ContainerMatrixHierarchicalTestEngine.java:90) ~[graylog-tests.jar:?]
17:41:17    at java.lang.Iterable.forEach(Iterable.java:75) ~[?:?]
17:41:17    at java.util.Collections$SynchronizedCollection.forEach(Collections.java:2131) ~[?:?]
17:41:17    at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1092) ~[?:?]
17:41:17    at org.junit.platform.engine.support.hierarchical.ContainerMatrixHierarchicalTestEngine.execute(ContainerMatrixHierarchicalTestEngine.java:59) ~[graylog-tests.jar:?]
17:41:17    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:147) ~[junit-platform-launcher-1.9.1.jar:1.9.1]
17:41:17    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:127) ~[junit-platform-launcher-1.9.1.jar:1.9.1]
17:41:17    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:90) ~[junit-platform-launcher-1.9.1.jar:1.9.1]
17:41:17    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:55) ~[junit-platform-launcher-1.9.1.jar:1.9.1]
17:41:17    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.withInterceptedStreams(EngineExecutionOrchestrator.java:102) [junit-platform-launcher-1.9.1.jar:1.9.1]
17:41:17    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:54) [junit-platform-launcher-1.9.1.jar:1.9.1]
17:41:17    at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:114) [junit-platform-launcher-1.9.1.jar:1.9.1]
17:41:17    at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:86) [junit-platform-launcher-1.9.1.jar:1.9.1]
17:41:17    at org.junit.platform.launcher.core.DefaultLauncherSession$DelegatingLauncher.execute(DefaultLauncherSession.java:86) [junit-platform-launcher-1.9.1.jar:1.9.1]
17:41:17    at org.apache.maven.surefire.junitplatform.LazyLauncher.execute(LazyLauncher.java:50) [surefire-junit-platform-3.0.0.jar:3.0.0]
17:41:17    at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.execute(JUnitPlatformProvider.java:184) [surefire-junit-platform-3.0.0.jar:3.0.0]
17:41:17    at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:148) [surefire-junit-platform-3.0.0.jar:3.0.0]
17:41:17    at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:122) [surefire-junit-platform-3.0.0.jar:3.0.0]
17:41:17    at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:385) [surefire-booter-3.0.0.jar:3.0.0]
17:41:17    at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162) [surefire-booter-3.0.0.jar:3.0.0]
17:41:17    at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:507) [surefire-booter-3.0.0.jar:3.0.0]
17:41:17    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:495) [surefire-booter-3.0.0.jar:3.0.0]
17:41:17  Caused by: org.rnorth.ducttape.RetryCountExceededException: Retry limit hit with exception
17:41:17    at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:88) ~[duct-tape-1.0.8.jar:?]
17:41:17    at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:334) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    ... 28 more
17:41:17  Caused by: org.testcontainers.containers.ContainerLaunchException: Could not create/start container
17:41:17    at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:542) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:344) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81) ~[duct-tape-1.0.8.jar:?]
17:41:17    at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:334) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    ... 28 more
17:41:17  Caused by: org.rnorth.ducttape.TimeoutException: java.util.concurrent.TimeoutException
17:41:17    at org.rnorth.ducttape.timeouts.Timeouts.callFuture(Timeouts.java:70) ~[duct-tape-1.0.8.jar:?]
17:41:17    at org.rnorth.ducttape.timeouts.Timeouts.doWithTimeout(Timeouts.java:60) ~[duct-tape-1.0.8.jar:?]
17:41:17    at org.testcontainers.containers.wait.strategy.WaitAllStrategy.waitUntilReady(WaitAllStrategy.java:54) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    at org.testcontainers.containers.GenericContainer.waitUntilContainerStarted(GenericContainer.java:953) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:485) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:344) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81) ~[duct-tape-1.0.8.jar:?]
17:41:17    at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:334) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    ... 28 more
17:41:17  Caused by: java.util.concurrent.TimeoutException
17:41:17    at java.util.concurrent.FutureTask.get(FutureTask.java:204) ~[?:?]
17:41:17    at org.rnorth.ducttape.timeouts.Timeouts.callFuture(Timeouts.java:65) ~[duct-tape-1.0.8.jar:?]
17:41:17    at org.rnorth.ducttape.timeouts.Timeouts.doWithTimeout(Timeouts.java:60) ~[duct-tape-1.0.8.jar:?]
17:41:17    at org.testcontainers.containers.wait.strategy.WaitAllStrategy.waitUntilReady(WaitAllStrategy.java:54) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    at org.testcontainers.containers.GenericContainer.waitUntilContainerStarted(GenericContainer.java:953) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:485) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:344) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81) ~[duct-tape-1.0.8.jar:?]
17:41:17    at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:334) ~[testcontainers-1.17.6.jar:1.17.6]
17:41:17    ... 28 more
17:41:17```
janheise commented 1 year ago

After Bernd reduced the number of parallel jobs on the Jenkins workers, a lot of sporadic occurring container startup problems were not seen again for a couple of days. I keep it open as a reminder but am hopeful that we don't see this error again.

janheise commented 1 year ago

https://github.com/Graylog2/graylog2-server/pull/16133 could be another try to resolve issues that sometimes seem to occur with the performance plugin on CI.