Closed alpar-t closed 5 years ago
Pinging @elastic/es-core-infra
@jasontedor @bleskes any ideas here? I looked but it's really puzzeling
Pinging @elastic/es-distributed
Relates #30962
It's failed again.
I'm going to mute it on Windows unless someone says they need to capture more failure logs.
I've reenabled the test on master (db327818dd0) and added DEBUG logging for org.elasticsearch.index.store
which will possibly tell us whether the move failed because a previous delete in renameTempFilesSafe
(where we catch exception and log as debug) failed.
For the person on test triage: If this test fails again on CI, please link the latest failure here.
Unfortunately that was on 5.6 (which does not have the extended logging, and which we forgot to mute). I've muted the test on 5.6 for now (a6aa7738d59).
Now there's a master failure :)
PartitionedRoutingIT.testShrinking
seems to have a similar issue:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-windows-compatibility/16/console
Similar failures on 5.6, 6.x and master in the last 12 hours.
gradlew :core:integTest \
-Dtests.seed=41732671CC43CCD8 \
-Dtests.class=org.elasticsearch.routing.PartitionedRoutingIT \
-Dtests.method="testShrinking" \
-Dtests.security.manager=true \
-Dtests.locale=ar-MA \
-Dtests.timezone=Europe/Tirane
seems to be caused by AccessDeniedException
Caused by: java.nio.file.AccessDeniedException: C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\server\build\testrun\integTest\J3\temp\org.elasticsearch.routing.PartitionedRoutingIT_22FF5B1F0934DA81-001\tempDir-002\data\nodes\0\indices\nCj_PB_1SvemMgYmkmpqNw\1\index\recovery.XyXJa1iLT_eQAR39YSXIKg._0.cfs -> C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\server\build\testrun\integTest\J3\temp\org.elasticsearch.routing.PartitionedRoutingIT_22FF5B1F0934DA81-001\tempDir-002\data\nodes\0\indices\nCj_PB_1SvemMgYmkmpqNw\1\index\_0.cfs
fyi. recent master change built was: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/19/console Trying to get to the machine and take a look at the logs now.
I think this also affects SplitIndexIT.testSplitFromOneToN
.
These two build logs contain the extra debugging info:
I spent some time trying to reproduce this by hand on one of the Windows CI workers. Although I never saw the test fail, I did see tests reporting the failures of recoveries for the reason quoted in the OP:
1> org.elasticsearch.indices.recovery.RecoveryFailedException: [second_shrink][0]: Recovery failed from {node_sd2}{TO36W-4uR0GLz8RwgnmWXg}{vtq1HlXqTy-5ut_hiS2-TQ}{127.0.0.1}{127.0.0.1:52206} into {node_sd1}{VwtnN_EuS8aKva_oys857Q}{OCu18ksZRdKzsOChRaIFkw}{127.0.0.1}{127.0.0.1:52177}
1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:283) [main/:?]
1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:81) [main/:?]
1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:628) [main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?]
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
1> at java.lang.Thread.run(Thread.java:844) [?:?]
1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_sd2][127.0.0.1:52206][internal:index/shard/recovery/start_recovery]
1> Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:174) ~[main/:?]
1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:99) ~[main/:?]
1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:51) ~[main/:?]
1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:108) ~[main/:?]
1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:105) ~[main/:?]
1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[main/:?]
1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1448) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
1> at java.lang.Thread.run(Thread.java:844) ~[?:?]
1> Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [7] files with total size of [8kb]
1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:422) ~[main/:?]
1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:172) ~[main/:?]
1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:99) ~[main/:?]
1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:51) ~[main/:?]
1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:108) ~[main/:?]
1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:105) ~[main/:?]
1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[main/:?]
1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1448) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
1> at java.lang.Thread.run(Thread.java:844) ~[?:?]
1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_sd1][127.0.0.1:52177][internal:index/shard/recovery/clean_files]
1> Caused by: java.nio.file.AccessDeniedException: C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\server\build\testrun\integTest\J0\temp\org.elasticsearch.action.admin.indices.create.ShrinkIndexIT_3825ACCB7B6BF7DB-001\tempDir-002\data\nodes\1\indices\4EzGadc6Rpa77FYUbJVhvA\0\index\recovery.uKRsdtvIQbmCSaJo9IlMzA._0.cfs -> C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\server\build\testrun\integTest\J0\temp\org.elasticsearch.action.admin.indices.create.ShrinkIndexIT_3825ACCB7B6BF7DB-001\tempDir-002\data\nodes\1\indices\4EzGadc6Rpa77FYUbJVhvA\0\index\_0.cfs
1> at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:89) ~[?:?]
1> at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) ~[?:?]
1> at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:298) ~[?:?]
1> at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:288) ~[?:?]
1> at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-8.0.0-snapshot-7d0a7782fa.jar:8.0.0-snapshot-7d0a7782fa 7d0a7782fa7e99250bccfb4d3e995485c3f0ca19 - jimczi - 2018-09-18 14:02:40]
1> at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-8.0.0-snapshot-7d0a7782fa.jar:8.0.0-snapshot-7d0a7782fa 7d0a7782fa7e99250bccfb4d3e995485c3f0ca19 - jimczi - 2018-09-18 14:02:40]
1> at java.nio.file.Files.move(Files.java:1413) ~[?:?]
1> at org.apache.lucene.store.FSDirectory.rename(FSDirectory.java:303) ~[lucene-core-8.0.0-snapshot-7d0a7782fa.jar:8.0.0-snapshot-7d0a7782fa 7d0a7782fa7e99250bccfb4d3e995485c3f0ca19 - jimczi - 2018-09-18 14:02:34]
1> at org.apache.lucene.store.MockDirectoryWrapper.rename(MockDirectoryWrapper.java:231) ~[lucene-test-framework-8.0.0-snapshot-7d0a7782fa.jar:8.0.0-snapshot-7d0a7782fa 7d0a7782fa7e99250bccfb4d3e995485c3f0ca19 - jimczi - 2018-09-18 14:02:34]
1> at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:89) ~[lucene-core-8.0.0-snapshot-7d0a7782fa.jar:8.0.0-snapshot-7d0a7782fa 7d0a7782fa7e99250bccfb4d3e995485c3f0ca19 - jimczi - 2018-09-18 14:02:34]
1> at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:89) ~[lucene-core-8.0.0-snapshot-7d0a7782fa.jar:8.0.0-snapshot-7d0a7782fa 7d0a7782fa7e99250bccfb4d3e995485c3f0ca19 - jimczi - 2018-09-18 14:02:34]
1> at org.elasticsearch.index.store.Store.renameTempFilesSafe(Store.java:335) ~[main/:?]
1> at org.elasticsearch.indices.recovery.RecoveryTarget.renameAllTempFiles(RecoveryTarget.java:188) ~[main/:?]
1> at org.elasticsearch.indices.recovery.RecoveryTarget.cleanFiles(RecoveryTarget.java:444) ~[main/:?]
1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$CleanFilesRequestHandler.messageReceived(PeerRecoveryTargetService.java:561) ~[main/:?]
1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$CleanFilesRequestHandler.messageReceived(PeerRecoveryTargetService.java:555) ~[main/:?]
1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[main/:?]
1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1448) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
1> at java.lang.Thread.run(Thread.java:844) ~[?:?]
Here is a copy of the console output, including repeated test runs (of which the stack trace above is the last). This tarball also includes a Process Monitor log from this run, filtered to just the relevant paths: 33857.tar.gz
Interesting things to note from the process monitor log are:
_0.cfs
which result in CANNOT DELETE
ACCESS DENIED
.I do not yet have a good idea for a next step on this.
On repeated runs I do see occasional failures as per the OP (maybe 20% of the time, I've not done the stats in detail yet). I tried writing a test case that repeatedly hits the filesystem directly and could not reproduce this AccessDeniedException
, so decided to turn my attention to all the Lucene-provided filesystem machinery that's in play here to see if I could rule that out as a source. The filesystem provider in use in a failing test is
HandleLimitFS(LeakFS(ShuffleFS(DisableFsyncFS(sun.nio.fs.WindowsFileSystemProvider@2d1a5acc))))
I was suspicious about that DisableFsyncFS
in there (the rest look benign) so I added @SuppressFileSystems(value = "DisableFsyncFS")
to the ShrinkIndexIT
fixture, but still got failures. I then tried @SuppressFileSystems(value = "*")
and still it fails - the filesystem provider in this case is confirmed to be a plain sun.nio.fs.WindowsFileSystemProvider
. The story continues...
We have more test failures on Windows with the same cause:
org.elasticsearch.routing.PartitionedRoutingIT#testShrinking()
[2018-09-27T18:57:01,933][WARN ][o.e.c.a.s.ShardStateAction] [node_s0] [index_2][0] received shard failed for shard id [[index_2][0]], allocation id [0Jyeijx5QoeBLzJ85Ym4uw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[index_2][0]: Recovery failed from {node_s1}{UFRWgSweS6uUDivYiTC1-w}{4v-Ak5QCQ62QpeiZyzwK8w}{local}{local[368]} into {node_s0}{Povm-pMfSAaqPh8OcDkjHg}{3UCCMKflQTOz9cF4-uejCA}{local}{local[366]}]; nested: RemoteTransportException[[node_s0][local[366]][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [25] files with total size of [36.4kb]]; nested: RemoteTransportException[[node_s1][local[368]][internal:index/shard/recovery/clean_files]]; nested: AccessDeniedException[C:\Users\jenkins\workspace\elastic+elasticsearch+5.6+multijob-windows-compatibility\core\build\testrun\integTest\J0\temp\org.elasticsearch.routing.PartitionedRoutingIT_D736F1C7B3CED870-001\tempDir-002\data\nodes\0\indices\qvLxi615Saym8cpW_c-H4Q\0\index\recovery.AWYcZDdo0i__cHMcxiHN._0.cfs -> C:\Users\jenkins\workspace\elastic+elasticsearch+5.6+multijob-windows-compatibility\core\build\testrun\integTest\J0\temp\org.elasticsearch.routing.PartitionedRoutingIT_D736F1C7B3CED870-001\tempDir-002\data\nodes\0\indices\qvLxi615Saym8cpW_c-H4Q\0\index\_0.cfs]; ] [...] Caused by: java.nio.file.AccessDeniedException: C:\Users\jenkins\workspace\elastic+elasticsearch+5.6+multijob-windows-compatibility\core\build\testrun\integTest\J0\temp\org.elasticsearch.routing.PartitionedRoutingIT_D736F1C7B3CED870-001\tempDir-002\data\nodes\0\indices\qvLxi615Saym8cpW_c-H4Q\0\index\recovery.AWYcZDdo0i__cHMcxiHN._0.cfs -> C:\Users\jenkins\workspace\elastic+elasticsearch+5.6+multijob-windows-compatibility\core\build\testrun\integTest\J0\temp\org.elasticsearch.routing.PartitionedRoutingIT_D736F1C7B3CED870-001\tempDir-002\data\nodes\0\indices\qvLxi615Saym8cpW_c-H4Q\0\index\_0.cfs at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83) ~[?:?] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[?:?] at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301) ~[?:?] at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287) ~[?:?] at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:40] at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:40] at java.nio.file.Files.move(Files.java:1395) ~[?:1.8.0_181] at org.apache.lucene.store.FSDirectory.rename(FSDirectory.java:297) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39] at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:88) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39] at org.apache.lucene.store.MockDirectoryWrapper.rename(MockDirectoryWrapper.java:229) ~[lucene-test-framework-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39] at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:88) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39] at org.elasticsearch.index.store.Store.renameTempFilesSafe(Store.java:319) ~[main/:?] at org.elasticsearch.indices.recovery.RecoveryTarget.renameAllTempFiles(RecoveryTarget.java:181) ~[main/:?] at org.elasticsearch.indices.recovery.RecoveryTarget.cleanFiles(RecoveryTarget.java:406) ~[main/:?] at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$CleanFilesRequestHandler.messageReceived(PeerRecoveryTargetService.java:486) ~[main/:?] at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$CleanFilesRequestHandler.messageReceived(PeerRecoveryTargetService.java:480) ~[main/:?] at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[main/:?] at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[main/:?] at org.elasticsearch.transport.local.LocalTransport$2.doRun(LocalTransport.java:390) ~[main/:?]
org.elasticsearch.routing.PartitionedRoutingIT#testShrinking()
org.elasticsearch.indices.recovery.RecoveryFailedException: [index_4][1]: Recovery failed from {node_s0}{7C-fb4R9RK-WS4Jk1WYXuQ}{BijPdN92Q32zPDZ9t_ldiA}{127.0.0.1}{127.0.0.1:55085} into {node_s1}{RqyKMgp3QZ-gu0IVK_U0AA}{I5djfftGTtmtr6bQbOAUJA}{127.0.0.1}{127.0.0.1:55086} 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:624) [main/:?] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?] 1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181] 1> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181] 1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_s0][127.0.0.1:55089][internal:index/shard/recovery/start_recovery] 1> Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:174) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1350) ~[main/:?] 1> ... 5 more 1> Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [13] files with total size of [19.3kb] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:424) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:172) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1350) ~[main/:?] 1> ... 5 more 1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_s1][127.0.0.1:55090][internal:index/shard/recovery/clean_files] 1> Caused by: java.nio.file.AccessDeniedException: C:\Users\jenkins\workspace\elastic+elasticsearch+6.x+multijob-windows-compatibility\server\build\testrun\integTest\J2\temp\org.elasticsearch.routing.PartitionedRoutingIT_CAE88D19276CD2B2-001\tempDir-002\data\nodes\1\indices\H6TUqJpRTRedJr1z26jubw\1\index\recovery.fshzJJ1CQCelF3mD6b587Q._0.cfs -> C:\Users\jenkins\workspace\elastic+elasticsearch+6.x+multijob-windows-compatibility\server\build\testrun\integTest\J2\temp\org.elasticsearch.routing.PartitionedRoutingIT_CAE88D19276CD2B2-001\tempDir-002\data\nodes\1\indices\H6TUqJpRTRedJr1z26jubw\1\index\_0.cfs 1> at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83) ~[?:?] 1> at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[?:?] 1> at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301) ~[?:?] 1> at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287) ~[?:?] 1> at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:16] 1> at java.nio.file.Files.move(Files.java:1395) ~[?:1.8.0_181] 1> at org.apache.lucene.store.FSDirectory.rename(FSDirectory.java:303) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13] 1> at org.apache.lucene.store.MockDirectoryWrapper.rename(MockDirectoryWrapper.java:231) ~[lucene-test-framework-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13] 1> at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:89) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13] 1> at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:89) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13] 1> at org.elasticsearch.index.store.Store.renameTempFilesSafe(Store.java:334) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoveryTarget.renameAllTempFiles(RecoveryTarget.java:188) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoveryTarget.cleanFiles(RecoveryTarget.java:452) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$CleanFilesRequestHandler.messageReceived(PeerRecoveryTargetService.java:557) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$CleanFilesRequestHandler.messageReceived(PeerRecoveryTargetService.java:551) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1350) ~[main/:?]
org.elasticsearch.routing.PartitionedRoutingIT#testShrinking()
org.elasticsearch.indices.recovery.RecoveryFailedException: [index_4][3]: Recovery failed from {node_sd1}{O8l8bMuJTv2FfmOwRxXWgg}{1m7utzhqQjWidDUawrikfQ}{127.0.0.1}{127.0.0.1:63944} into {node_sd2}{IBcohRX6RB2qHr4S6wDBUQ}{Fa3f-veRRFOWmrrADglJ9A}{127.0.0.1}{127.0.0.1:63945} 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) [main/:?] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?] 1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181] 1> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181] 1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_sd1][127.0.0.1:63944][internal:index/shard/recovery/start_recovery] 1> Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:174) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) ~[main/:?] 1> ... 5 more 1> Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [4] files with total size of [4.5kb] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:446) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:172) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) ~[main/:?] 1> ... 5 more 1> Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [4] files with total size of [4.5kb] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:446) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:172) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) ~[main/:?] 1> ... 5 more 1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_sd2][127.0.0.1:63945][internal:index/shard/recovery/clean_files] 1> Caused by: java.nio.file.AccessDeniedException: C:\Users\jenkins\workspace\elastic+elasticsearch+6.4+multijob-windows-compatibility\server\build\testrun\integTest\J2\temp\org.elasticsearch.routing.PartitionedRoutingIT_1C4D221BAD62999F-001\tempDir-002\data\nodes\2\indices\SI6IoZJ5R_Wb4Nqx06oDmA\3\index\recovery.CB9P345NRmmuu-tPzwjksg._0.cfs -> C:\Users\jenkins\workspace\elastic+elasticsearch+6.4+multijob-windows-compatibility\server\build\testrun\integTest\J2\temp\org.elasticsearch.routing.PartitionedRoutingIT_1C4D221BAD62999F-001\tempDir-002\data\nodes\2\indices\SI6IoZJ5R_Wb4Nqx06oDmA\3\index\_0.cfs
org.elasticsearch.action.admin.indices.create.SplitIndexIT#testCreateSplitIndexToN()
:
org.elasticsearch.indices.recovery.RecoveryFailedException: [second_split][0]: Recovery failed from {node_sd2}{hcRrVZwPQ7e8MTakDgz_GQ}{wfXHyRV6RPW0sKTXdamhdA}{127.0.0.1}{127.0.0.1:58177} into {node_sd3}{VlMAJBDSTemm3IJZgPciLA}{zPecWbBiSWaagZwl7Q1l6g}{127.0.0.1}{127.0.0.1:58171} 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) [main/:?] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?] 1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181] 1> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181] 1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_sd2][127.0.0.1:58177][internal:index/shard/recovery/start_recovery] 1> Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:174) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) ~[main/:?] 1> ... 5 more 1> Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [5] files with total size of [5.6kb] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:446) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:172) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) ~[main/:?] 1> ... 5 more 1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_sd3][127.0.0.1:58171][internal:index/shard/recovery/clean_files] 1> Caused by: java.nio.file.AccessDeniedException: C:\Users\jenkins\workspace\elastic+elasticsearch+6.4+multijob-windows-compatibility\server\build\testrun\integTest\J2\temp\org.elasticsearch.action.admin.indices.create.SplitIndexIT_1C4D221BAD62999F-001\tempDir-002\data\nodes\3\indices\xKk-zXSXQsGpeMmq5ysVpA\0\index\recovery._dztPVO1RBKeP7ZpRHIbiA._0.cfs -> C:\Users\jenkins\workspace\elastic+elasticsearch+6.4+multijob-windows-compatibility\server\build\testrun\integTest\J2\temp\org.elasticsearch.action.admin.indices.create.SplitIndexIT_1C4D221BAD62999F-001\tempDir-002\data\nodes\3\indices\xKk-zXSXQsGpeMmq5ysVpA\0\index\_0.cfs 1> at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83) ~[?:?] 1> at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[?:?] 1> at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301) ~[?:?] 1> at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287) ~[?:?] 1> at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:47] 1> at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:47] 1> at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:47] 1> at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:47] 1> at java.nio.file.Files.move(Files.java:1395) ~[?:1.8.0_181] 1> at org.apache.lucene.store.FSDirectory.rename(FSDirectory.java:303) ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:45] 1> at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:89) ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:45] 1> at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:89) ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:45] 1> at org.elasticsearch.index.store.Store.renameTempFilesSafe(Store.java:337) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoveryTarget.renameAllTempFiles(RecoveryTarget.java:188) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoveryTarget.cleanFiles(RecoveryTarget.java:439) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$CleanFilesRequestHandler.messageReceived(PeerRecoveryTargetService.java:556) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$CleanFilesRequestHandler.messageReceived(PeerRecoveryTargetService.java:550) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) ~[main/:?]
Looking harder at the logs, what happens is:
second_shrink
to one nodeHowever we failed to delete the unassigned shards from the first node after the rebalance (at least twice, maybe it succeeds, we're still looking). For some reason when we try and allocate a shard on top of the old directory it then fails with the AccessDeniedException
.
Note that we're using MMapDirectory
here. Setting index.store.type: simplefs
yielded no failure in 10 iterations.
I have tried to find a way to reproduce this with fewer moving parts, but so far this has failed. The following test passes.
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.action.admin.indices.create;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.IOContext;
import org.apache.lucene.store.IndexOutput;
import org.elasticsearch.Version;
import org.elasticsearch.cluster.metadata.IndexMetaData;
import org.elasticsearch.common.settings.Setting;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.unit.TimeValue;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.shard.ShardId;
import org.elasticsearch.index.store.Store;
import org.elasticsearch.test.DummyShardLock;
import org.elasticsearch.test.ESTestCase;
import org.elasticsearch.test.junit.annotations.TestLogging;
import org.elasticsearch.threadpool.TestThreadPool;
import org.elasticsearch.threadpool.ThreadPool;
import java.io.IOException;
import java.nio.file.Path;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicBoolean;
import static org.elasticsearch.cluster.metadata.IndexMetaData.INDEX_NUMBER_OF_REPLICAS_SETTING;
import static org.elasticsearch.cluster.metadata.IndexMetaData.INDEX_NUMBER_OF_SHARDS_SETTING;
import static org.elasticsearch.cluster.metadata.IndexMetaData.SETTING_INDEX_VERSION_CREATED;
import static org.elasticsearch.cluster.metadata.IndexMetaData.SETTING_NUMBER_OF_SHARDS;
import static org.elasticsearch.index.store.Store.INDEX_STORE_STATS_REFRESH_INTERVAL_SETTING;
@TestLogging("org.elasticsearch:TRACE,index.store.deletes:TRACE")
public class WindowsFilesystemTests extends ESTestCase {
public void testWindowsFilesystem() {
ThreadPool threadPool = new TestThreadPool("test");
try {
final Path tempDir1 = createTempDir();
final IOContext ioContext = new IOContext();
final int attemptCount = 1000;
final CountDownLatch countDownLatch = new CountDownLatch(attemptCount);
try (Directory rawDirectory = FSDirectory.open(tempDir1)) {
logger.info("rawDirectory = {}", rawDirectory);
final ShardId shardId = new ShardId("foo", "bar", 0);
final Store store = new Store(shardId,
new IndexSettings(IndexMetaData.builder(shardId.getIndexName())
.settings(Settings.builder()
.put(INDEX_NUMBER_OF_SHARDS_SETTING.getKey(), 1)
.put(INDEX_NUMBER_OF_REPLICAS_SETTING.getKey(), 0)
.put(SETTING_INDEX_VERSION_CREATED.getKey(), Version.CURRENT)
.put(INDEX_STORE_STATS_REFRESH_INTERVAL_SETTING.getKey(), TimeValue.ZERO)).build(), Settings.EMPTY),
rawDirectory, new DummyShardLock(shardId));
final AtomicBoolean queryStatsThreadCancelled = new AtomicBoolean();
final Thread queryStatsThread = new Thread(() -> {
while (queryStatsThreadCancelled.get() == false) {
try {
store.stats();
} catch (IOException e) {
throw new AssertionError("unexpected", e);
}
}
});
queryStatsThread.start();
final Directory directory = store.directory();
logger.info("directory = {}", directory);
for (int i = 0; i < attemptCount; i++) {
for (int j = 1; j <= 2; j++) {
try (IndexOutput indexOutput = directory.createOutput("test-" + i + "-" + j + ".dat", ioContext)) {
int longCount = randomIntBetween(100, 10000);
for (int k = 0; k < longCount; k++) {
indexOutput.writeLong(randomLong());
}
}
}
}
for (int i = 0; i < attemptCount; i++) {
final String fileName1 = "test-" + i + "-1.dat";
final String fileName2 = "test-" + i + "-2.dat";
threadPool.generic().execute(() -> {
try {
directory.deleteFile(fileName1);
Thread.sleep(randomIntBetween(1, 50));
} catch (IOException | InterruptedException e) {
countDownLatch.countDown();
throw new AssertionError("unexpected", e);
}
threadPool.generic().execute(() -> {
try {
directory.rename(fileName2, fileName1);
} catch (IOException e) {
countDownLatch.countDown();
throw new AssertionError("unexpected", e);
}
threadPool.generic().execute(() -> {
try {
directory.deleteFile(fileName1);
} catch (IOException e) {
throw new AssertionError("unexpected", e);
} finally {
countDownLatch.countDown();
}
});
});
});
}
countDownLatch.await();
queryStatsThreadCancelled.set(true);
queryStatsThread.join();
}
} catch (IOException | InterruptedException e) {
throw new AssertionError("exception", e);
} finally {
threadPool.shutdown();
}
}
}
One interesting side-question is why this happens more on tests to do with shrinking. I think this is the answer:
In other cases, I guess an allocation that fails like this will retry (up to 5 times) and subsequent attempts will succeed.
Correction: subsequent attempts to allocate a copy of the shrunk index onto the node containing the source index will, I think, all fail, because of https://github.com/elastic/elasticsearch/issues/33857#issuecomment-425438171 (unless either the source index is removed or closed, or else all the shared segments are merged away, neither of which seems very likely in this test).
To test this hypothesis, I tried writing a test that repeatedly allocates a shard to a node and then removes it, hoping to hit an allocation failure, with max retries set to 0:
@ClusterScope(scope = TEST, numDataNodes = 2)
@TestLogging("org.elasticsearch:TRACE,index.store.deletes:TRACE")
public class ReliableRecoveryIT extends ESIntegTestCase {
public void testReliableRecovery() {
prepareCreate("index").setSettings(Settings.builder().put(indexSettings())
.put(INDEX_NUMBER_OF_SHARDS_SETTING.getKey(), 1)
.put(INDEX_NUMBER_OF_REPLICAS_SETTING.getKey(), 1)
.put(SETTING_ALLOCATION_MAX_RETRY.getKey(), 0))
.get();
ensureGreen("index");
for (int i = 0; i < 20; i++) {
internalCluster().client().prepareBulk()
.add(new IndexRequest("index", "_doc", null).source("{}", XContentType.JSON))
.add(new IndexRequest("index", "_doc", null).source("{}", XContentType.JSON))
.add(new IndexRequest("index", "_doc", null).source("{}", XContentType.JSON))
.add(new IndexRequest("index", "_doc", null).source("{}", XContentType.JSON))
.add(new IndexRequest("index", "_doc", null).source("{}", XContentType.JSON))
.setRefreshPolicy(IMMEDIATE)
.get();
}
for (int i = 0; i < 10; i++) {
logger.info("--> removing replica [{}]", i);
internalCluster().client().admin().indices().prepareUpdateSettings("index").setSettings(
Settings.builder().put(INDEX_NUMBER_OF_REPLICAS_SETTING.getKey(), 0)).get();
logger.info("--> adding replica [{}]", i);
internalCluster().client().admin().indices().prepareUpdateSettings("index").setSettings(
Settings.builder().put(INDEX_NUMBER_OF_REPLICAS_SETTING.getKey(), 1)).get();
logger.info("--> awaiting green [{}]", i);
ensureGreen("index");
}
}
}
This passes. I'm mystified.
One interesting side-question is why this happens more on tests to do with shrinking. I think this is the answer:
Good catch. We should clean these up IMO.
Ok, I've got it 🎉
The following test passes on Windows (i.e. throws an AccessDeniedException
on the rename
). It seems that if you create a hard link to a file, and then delete the link, but the original file is still open, then you can't rename something else into the place where the hard link used to be.
public void testWindowsFilesystem() throws IOException {
final Path directoryPath = createTempDir();
final IOContext ioContext = new IOContext();
try (Directory directory = FSDirectory.open(directoryPath)) {
logger.info("directory = {}", directory);
final String fileName1 = "test-1.dat";
final String fileName2 = "test-2.dat";
final String fileName3 = "test-3.dat";
Files.write(directoryPath.resolve(fileName1), new byte[]{0}); // must be nonempty
Files.write(directoryPath.resolve(fileName2), new byte[0]); // empty file
try (IndexInput ignored = directory.openInput(fileName1, ioContext)) {
Files.createLink(directoryPath.resolve(fileName3), directoryPath.resolve(fileName1));
directory.deleteFile(fileName3);
expectThrows(AccessDeniedException.class, () -> directory.rename(fileName2, fileName3));
}
}
}
More precisely, attempts to delete fileName3
while fileName1
is still open will fail with AccessDeniedException
, but we suppress these inside FSDirectory
.
This only happens when using MMapDirectory
and not SimpleFSDirectory
.
Still experiencing failure today on 6.4
REPRODUCE WITH: gradlew :server:integTest \ -Dtests.seed=544A239B8224F7C5 \ -Dtests.class=org.elasticsearch.routing.PartitionedRoutingIT \ -Dtests.method="testShrinking" \ -Dtests.security.manager=true \ -Dtests.locale=en-CA \ -Dtests.timezone=SystemV/YST9YDT
and 6.x Log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-windows-compatibility/35/console
07:44:40 FAILURE 33.3s J2 | PartitionedRoutingIT.testShrinking <<< FAILURES! 07:44:40 > Throwable #1: java.lang.AssertionError: timed out waiting for green state 07:44:40 > at __randomizedtesting.SeedInfo.seed([544A239B8224F7C5:AE69E6F0A5B2602E]:0) 07:44:40 > at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:967) 07:44:40 > at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:923) 07:44:40 > at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:912) 07:44:40 > at org.elasticsearch.routing.PartitionedRoutingIT.testShrinking(PartitionedRoutingIT.java:96) 07:44:40 > at java.lang.Thread.run(Thread.java:748)
I have muted PartitionedRoutingIT:: testShrinking
test on 6.4 and 6.x on Windows
master
still failing similarly:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/42/console
REPRODUCE WITH: gradlew :server:integTest \
-Dtests.seed=7379D1534EE3AFC8 \
-Dtests.class=org.elasticsearch.routing.PartitionedRoutingIT \
-Dtests.method="testShrinking" \
-Dtests.security.manager=true \
-Dtests.locale=pl-PL \
-Dtests.timezone=US/Arizona \
-Dcompiler.java=11 \
-Druntime.java=8
We've worked out what's going on here so I've unassigned myself from this issue. Similar failures on Windows should be muted until we work out what to do with it - no need to report them here.
This is a Windows limitation with MMapDirectory we cannot do anything about -> Closing
5.6
and
It also happened on 6.4, 6.x and master
I was not able to reproduce it on a Windows 2012 VM. There's no evidence of similar failures in the past, nor recent changes to the 5.6 that are obviously related.