[CI] FullClusterRestartIT.testSnapshotRestore breaks other tests by wiping out persistent tasks

tvernum commented 5 years ago

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/867/console

Reproduces for me with a variety of seeds.

/gradlew :x-pack:qa:full-cluster-restart:with-system-key:v6.6.0#upgradedClusterTestRunner \
-Dtests.seed=A29E5A089272CB0C \
-Dtests.class=org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT \
-Dtests.method="testMigration" \
-Dtests.security.manager=true -Dtests.locale=ms -Dtests.timezone=America/Araguaina \
-Dcompiler.java=11 -Druntime.java=8

MlMigrationFullClusterRestartIT.testMigration <<< FAILURES!
   > Throwable #1: java.lang.AssertionError
   >    at __randomizedtesting.SeedInfo.seed([655FBD85FBD8974E:A6E3BD79B4BC8D2D]:0)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForMigration$2(MlMigrationFullClusterRestartIT.java:167)
   >    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:847)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.waitForMigration(MlMigrationFullClusterRestartIT.java:159)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.upgradedClusterTests(MlMigrationFullClusterRestartIT.java:109)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.testMigration(MlMigrationFullClusterRestartIT.java:77)
   >    at java.lang.Thread.run(Thread.java:745)
   >    Suppressed: java.lang.AssertionError
   >            at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForMigration$2(MlMigrationFullClusterRestartIT.java:167)
   >            at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   >            ... 40 more
   >    Suppressed: java.lang.AssertionError
   >            at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForMigration$2(MlMigrationFullClusterRestartIT.java:167)
   >            at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   >            ... 40 more

I'm going to mute the test.

elasticmachine commented 5 years ago

Pinging @elastic/ml-core

tvernum commented 5 years ago

Muted in 06b175d

pgomulka commented 5 years ago

I have this one failing on 6.6 and few times on 6.x build stats&_a=(columns:!(_source),filters:!(),index:b646ed00-7efc-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:MlMigrationFullClusterRestartIT),sort:!(process.time-start,desc)))

I wonder if we should back port this change?

./gradlew :x-pack:qa:full-cluster-restart:with-system-key:v6.6.0#upgradedClusterTestRunner \
  -Dtests.seed=D791D0C47DC7EAF4 \
  -Dtests.class=org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT \
  -Dtests.method="testMigration" \
  -Dtests.security.manager=true \
  -Dtests.locale=mt \
  -Dtests.timezone=America/Nipigon \
  -Dcompiler.java=11 \
  -Druntime.java=8

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+bwc-tests/125/console

droberts195 commented 5 years ago

I backported #37077 to 6.6 to fix the reproducible failure that occurs when testing upgrade from 6.6.0 to 6.6.1.

The 6.x failure must be something different as that already had #37077.

droberts195 commented 5 years ago

One of the 6.x failures is https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/376/consoleText

The error is:

   > Throwable #1: org.junit.ComparisonFailure: expected:<[open]ed> but was:<[clos]ed>
   >    at __randomizedtesting.SeedInfo.seed([3C4D9B41A53DB36A:FFF19BBDEA59A909]:0)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForJobToBeAssigned$0(MlMigrationFullClusterRestartIT.java:181)
   >    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.waitForJobToBeAssigned(MlMigrationFullClusterRestartIT.java:172)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.upgradedClusterTests(MlMigrationFullClusterRestartIT.java:146)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.testMigration(MlMigrationFullClusterRestartIT.java:87)
   >    at java.lang.Thread.run(Thread.java:748)

The root cause seems to be a race. From v5.6.10#upgradedClusterTestCluster node0/elasticsearch-6.7.0-SNAPSHOT/logs/full-cluster-restart.log:

[2019-01-30T22:06:25,714][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [migration-old-cluster-open-job] [autodetect/21203] [CResourceMonitor.cc@68] Setting model memory limit to 4096 MB

...

[2019-01-30T22:06:27,641][INFO ][o.e.x.m.j.p.a.ProcessContext] [node-0] Killing job [migration-old-cluster-open-job], because [task has been removed, cancelling locally]
[2019-01-30T22:06:27,649][WARN ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-0] [migration-old-cluster-open-job] some results not processed due to the process being killed
[2019-01-30T22:06:27,653][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [controller/20292] [CDetachedProcessSpawner.cc@177] Child process with PID 21203 was terminated by signal 15
[2019-01-30T22:06:27,653][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [node-0] Received ban for the parent [lCoBrFUwS42T5jSfuPOm4A:890] on the node [lCoBrFUwS42T5jSfuPOm4A], reason: [task has been removed, cancelling locally]

...

[2019-01-30T22:06:29,935][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [migration-old-cluster-open-job] [autodetect/21319] [CResourceMonitor.cc@68] Setting model memory limit to 4096 MB

...

[2019-01-30T22:06:30,137][ERROR][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Could not set job state to [opened] for job [migration-old-cluster-open-job]
org.elasticsearch.ResourceNotFoundException: the task with id job-migration-old-cluster-open-job and allocation id 7 doesn't exist
    at org.elasticsearch.persistent.PersistentTasksClusterService$4.execute(PersistentTasksClusterService.java:235) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:642) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:271) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:201) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:136) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

Or in plain English:

The node is upgraded and notices that it's supposed to have migration-old-cluster-open-job running on it, so starts a process for migration-old-cluster-open-job
The node notices that there is no persistent task for migration-old-cluster-open-job or the persistent task is not assigned to this node, so kills the process started in 1.
The node once again notices that it's supposed to have migration-old-cluster-open-job running on it, so starts another process for migration-old-cluster-open-job
Meanwhile the update to the task state caused by 2. has gone through
The update to the task state triggered by 3. fails because the task has been updated since the version that step was working off of

I worry that this is a real bug that would affect customers on 6.7. Immediately before the first attempt at starting the job we have:

[2019-01-30T22:06:24,641][INFO ][o.e.x.m.MlConfigMigrator ] [node-0] ml job configurations migrated: [migration-old-cluster-closed-job, migration-old-cluster-open-job]
[2019-01-30T22:06:24,641][INFO ][o.e.x.m.MlConfigMigrator ] [node-0] ml datafeed configurations migrated: [migration-old-cluster-started-datafeed, migration-old-cluster-stopped-datafeed]
[2019-01-30T22:06:24,695][WARN ][o.e.x.w.WatcherService   ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,775][WARN ][o.e.x.w.WatcherService   ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,831][WARN ][o.e.x.w.WatcherService   ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,833][WARN ][o.e.x.m.MlAssignmentNotifier] [node-0] [migration-old-cluster-open-job] No node found to start datafeed [migration-old-cluster-started-datafeed]. Reasons []
[2019-01-30T22:06:24,878][WARN ][o.e.x.w.WatcherService   ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,895][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Opening job [migration-old-cluster-open-job]

We rewrite the persistent tasks in that migration step, and there could be a race there.

Thankfully the step of migrating configs for open jobs is new in 6.7, so not released yet. We need to scrutinise in detail the code that rewrites the persistent tasks when migrating the open job configs before 6.7 is released.

The race is probably not that likely, as the BWC tests had iterated from 5.6.0 to 5.6.10 before hitting this problem, and there's no reason to think it's unique to upgrading from 5.6.10. So that suggests something like a 9% chance of the race occurring (1 in 11).

cbuescher commented 5 years ago

@droberts195 I see several others of these over the weekend on 6.x and master as well, e.g.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/394/console

org.junit.ComparisonFailure: expected:<[open]ed> but was:<[clos]ed>
    at __randomizedtesting.SeedInfo.seed([37F2A222C9FBA52A:F44EA2DE869FBF49]:0)
    at org.junit.Assert.assertEquals(Assert.java:115)
    at org.junit.Assert.assertEquals(Assert.java:144)
    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForJobToBeAssigned$0(MlMigrationFullClusterRestartIT.java:181)
    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.waitForJobToBeAssigned(MlMigrationFullClusterRestartIT.java:172)

and also: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1796/console https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/389/console https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/388/console https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=sles-12/216/console

Should we mute this or are you close to a fix?

droberts195 commented 5 years ago

@cbuescher please can you mute it in master and 6.x? I think it's going to be at least a day before I can have a fix ready to merge.

cbuescher commented 5 years ago

Mutes on master and 6.x

droberts195 commented 5 years ago

I found the problem, and thankfully it is not in the code that we expect production users will run. The problem is that FullClusterRestartIT.testSnapshotRestore restores a snapshot with include_global_state=true. Depending on the order the tests run, this can wipe out the persistent tasks in the cluster state, or cause them to be different to the ones the PersistentTasksNodeService has recorded as currently assigned to a node, thus causing a reassignment.

(The fact that restoring global cluster state resets persistent tasks to whatever state they were in at the time the snapshot was taken has been the case ever since persistent tasks were introduced. So this is an effect that production users can experience, but ML config migration doesn't change it one way or the other.)

The tests don't run in parallel so you may think that they should not interact in this way. But MlMigrationFullClusterRestartIT.testMigration is relying on behaviour that is performed automatically on any cluster state update. So the assertions that MlMigrationFullClusterRestartIT.testMigration makes are not looking for the result of actions taken by the test itself, but looking for effects of the cluster starting up on the newer version and doing anything that results in a cluster state update.

The possible solutions I can think of are:

Make sure FullClusterRestartIT.testSnapshotRestore always runs first in the old cluster and last in the upgraded cluster, so that it does not interfere with the state of persistent tasks that other tests rely on (or at least not during the period when those other tests are making assertions that rely on that state)
Move FullClusterRestartIT.testSnapshotRestore into a separate test suite
Move MlMigrationFullClusterRestartIT.testMigration into a separate test suite *
Exclude FullClusterRestartIT.testSnapshotRestore from the X-Pack full cluster restart tests - it's one of the tests that's already run in the OSS full cluster restart tests and duplicated in X-Pack to test with security (so this would mean testing snapshots only without security)
Split include_global_state into more fine-grained options, so that it's possible to restore just cluster settings, or just index templates, or just persistent tasks

* I don't think this is a good option, as meddling with persistent tasks could be affecting other tests too. For example, #32773 might be caused by the same side effects of FullClusterRestartIT.testSnapshotRestore.

droberts195 commented 5 years ago

For future reference, the log message that alerted me to what was happening was Can't read metadata from store, will not reuse local files during restore - I'm not sure if this is expected, and if it's not and that bug is fixed then I don't think there would have been any sign of the root cause of the ML migration test failure in the logs...

polyfractal commented 5 years ago

For test purposes, option 2 seems the easiest/safest. Since we know it could drastically alter other tests, keeping it isolated seems best. Option 1 and 4 would both work but seem more fragile?

Option 5 would probably be a nice feature to have in general. I could see people wanting to selectively restore different parts of the cluster state, potentially. But probably bigger impact change, and wouldn't help retroactively.

jkakavas commented 5 years ago

~@droberts195 testMappingsUpgrade now fails reproducible for me on master for a number of seeds.~

~Would it be worth muting this ? I am unsure how other people's PR runs are passing~

Apologies, this was a red herring. The real culprit was unrelated to this issue

  > Warnings: [[types removal] The response format of get index template requests will change in the next major version. Please start using the `include_type_name` parameter set to `false` in the request to move to the new, typeless response format that will be the default in 7.0.]

which causes all tests that implement AbstractUpgradeTestCase to fail

elasticsearchmachine commented 1 year ago

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine commented 1 week ago

This issue has been closed because it has been open for too long with no activity.

Any muted tests that were associated with this issue have been unmuted.

If the tests begin failing again, a new issue will be opened, and they may be muted again.

elasticsearchmachine commented 1 week ago

This issue is getting re-opened because there are still AwaitsFix mutes for the given test. It will likely be closed again in the future.

elastic / elasticsearch

[CI] FullClusterRestartIT.testSnapshotRestore breaks other tests by wiping out persistent tasks #36816