[CI] GeoIpDownloaderCliIT testInvalidTimestamp failing

mark-vieira commented 1 year ago

All the ingest-geoip module internal cluster tests are failing on Windows. This looks to be due to a recent refactoring where because we removed the reliance on docker, these tests can now run on Windows. My hunch is these never would have worked on Windows due to any number of reasons, and so they are just failing now because we're running them on the platform.

Build scan: https://gradle-enterprise.elastic.co/s/c4vwmewipm256/tests/:modules:ingest-geoip:internalClusterTest/org.elasticsearch.ingest.geoip.GeoIpDownloaderCliIT/testInvalidTimestamp

Reproduction line:

gradlew ':modules:ingest-geoip:internalClusterTest' --tests "org.elasticsearch.ingest.geoip.GeoIpDownloaderCliIT.testInvalidTimestamp" -Dtests.seed=C865186684FA4C7E -Dtests.locale=ga -Dtests.timezone=Africa/Asmara -Druntime.java=19

Applicable branches: main, 8.7, 8.6

Reproduces locally?: Yes

Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.ingest.geoip.GeoIpDownloaderCliIT&tests.test=testInvalidTimestamp

Failure excerpt:

java.lang.AssertionError: 
Expected: not a collection containing "GeoLite2-ASN.mmdb"
     but: was <[GeoLite2-ASN.mmdb_LICENSE.txt, GeoLite2-ASN.mmdb_README.txt, GeoLite2-ASN.mmdb_COPYRIGHT.txt, GeoLite2-Country.mmdb, GeoLite2-ASN.mmdb, GeoLite2-City.mmdb]>

  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.junit.Assert.assertThat(Assert.java:923)
  at org.elasticsearch.ingest.geoip.GeoIpDownloaderIT.lambda$testInvalidTimestamp$7(GeoIpDownloaderIT.java:181)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1112)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1085)
  at org.elasticsearch.ingest.geoip.GeoIpDownloaderIT.testInvalidTimestamp(GeoIpDownloaderIT.java:177)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
  at java.lang.reflect.Method.invoke(Method.java:578)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1589)

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

mark-vieira commented 1 year ago

@joegallo since you helped with the original PR #93488 do you have thoughts here? My initial reaction is to simply mute these tests on Windows for now, since we weren't running them on Windows before. I think it's something specific to the implemetnation of the internal cluster tests and Windows filesystem wonkiness. The YAML rest test do run and pass on Windows so at least we have more coverage than we used to here.

mark-vieira commented 1 year ago

I've decided to mute these for now until we decide whether we want to go through the effort to get these tests running on Windows.

It hadn't really occurred to me though that this highlights that any tests we currently have that rely on docker-based fixtures have zero coverage on Windows.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-data-management (Team:Data Management)

masseyke commented 1 year ago

This looks similar to other windows tests I've seen. where the root cause was that that we have a race condition where the file handle is still open. Linux is happy to delete a file even if something has it open, and Windows is not. So I think the bug is us, not Windows. Here's the stack trace in this one:

  1> java.nio.file.FileSystemException: C:\Users\jenkins\workspace\platform-support\199\modules\ingest-geoip\build\testrun\internalClusterTest\temp\geoip-databases\Z78vw9CVT7KwL3hwRQPZQA\GeoLite2-Country.mmdb: The process cannot access the file because it is being used by another process
  1>    at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92) ~[?:?]
  1>    at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) ~[?:?]
  1>    at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:108) ~[?:?]
  1>    at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:275) ~[?:?]
  1>    at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105) ~[?:?]
  1>    at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.delete(FilterFileSystemProvider.java:134) ~[lucene-test-framework-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
  1>    at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.delete(FilterFileSystemProvider.java:134) ~[lucene-test-framework-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
  1>    at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.delete(FilterFileSystemProvider.java:134) ~[lucene-test-framework-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
  1>    at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.delete(FilterFileSystemProvider.java:134) ~[lucene-test-framework-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
  1>    at java.nio.file.Files.delete(Files.java:1152) ~[?:?]
  1>    at org.elasticsearch.ingest.geoip.DatabaseReaderLazyLoader.doClose(DatabaseReaderLazyLoader.java:235) ~[main/:?]
  1>    at org.elasticsearch.ingest.geoip.DatabaseReaderLazyLoader.close(DatabaseReaderLazyLoader.java:224) ~[main/:?]
  1>    at org.elasticsearch.ingest.geoip.DatabaseReaderLazyLoader.close(DatabaseReaderLazyLoader.java:218) ~[main/:?]
  1>    at org.elasticsearch.ingest.geoip.DatabaseNodeService.removeStaleEntries(DatabaseNodeService.java:385) ~[main/:?]
  1>    at org.elasticsearch.ingest.geoip.DatabaseNodeService.checkDatabases(DatabaseNodeService.java:258) ~[main/:?]
  1>    at org.elasticsearch.ingest.geoip.DatabaseNodeService.lambda$initialize$0(DatabaseNodeService.java:161) ~[main/:?]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:559) ~[elasticsearch-8.6.3-SNAPSHOT.jar:8.6.3-SNAPSHOT]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:545) ~[elasticsearch-8.6.3-SNAPSHOT.jar:8.6.3-SNAPSHOT]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:504) ~[elasticsearch-8.6.3-SNAPSHOT.jar:8.6.3-SNAPSHOT]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:428) ~[elasticsearch-8.6.3-SNAPSHOT.jar:8.6.3-SNAPSHOT]
  1>    at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) ~[elasticsearch-8.6.3-SNAPSHOT.jar:8.6.3-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:850) ~[elasticsearch-8.6.3-SNAPSHOT.jar:8.6.3-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:257) ~[elasticsearch-8.6.3-SNAPSHOT.jar:8.6.3-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:223) ~[elasticsearch-8.6.3-SNAPSHOT.jar:8.6.3-SNAPSHOT]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
  1>    at java.lang.Thread.run(Thread.java:1589) ~[?:?]

Here's where I posted about what I think causes a failure that's similar: https://github.com/elastic/elasticsearch/issues/92183#issuecomment-1382581670

And here are some others that might have the same problem (they've got similar symptoms anyway): https://github.com/elastic/elasticsearch/issues/91609 https://github.com/elastic/elasticsearch/issues/90444 https://github.com/elastic/elasticsearch/issues/77490 https://github.com/elastic/elasticsearch/issues/65047 https://github.com/elastic/elasticsearch/issues/92866

masseyke commented 1 year ago

Adding some notes about my debugging failures on this one: Since I don't have access to a Windows box, I hacked the code so that it always usesorg.apache.lucene.tests.mockfile.WindowsFS as the filesystem (that one works like windows in that it will blow up if you try to delete a file that is open). I ran it hundreds of times, with no failures. I put some artificial Thread.sleep() calls in before closing the file handles in DatabaseReaderLazyLoader:getDatabaseType (the try/finally block that calls databaseInputStream()). No luck. The only way I could reproduce this problem is to change WindowsFS to ignore when a file handle was released. Then I get the exact same error. But I have no idea how that could actually happen.

masseyke commented 1 year ago

Oh, I also logged out everywhere we accessed GeoLite2-Country.mmdb, just in case there was some place with a risk that we weren't closing a file handle. No luck. We only do it from these two places:

mark-vieira commented 1 year ago

It's somewhat worth noting that these tests running on Windows at all is relatively new. This only started a couple of months ago when the geo-ip fixture was refactored to remove the Docker requirement. So I'm not sure how much heartache we want to go through here to get these running, given we've only recently started running them on Windows, before we just mute. Your call.

DaveCTurner commented 1 year ago

I'm not sure if https://github.com/elastic/elasticsearch/issues/97082 is the same as this or not. It looks kinda related, but also doesn't seem to involve Windows. I'm leaving it as a separate issue but feel free to close-as-duplicate.

elastic / elasticsearch

[CI] GeoIpDownloaderCliIT testInvalidTimestamp failing #94098