Open tvernum opened 5 years ago
Pinging @elastic/ml-core
Muted in 06b175d
I have this one failing on 6.6 and few times on 6.x build stats&_a=(columns:!(_source),filters:!(),index:b646ed00-7efc-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:MlMigrationFullClusterRestartIT),sort:!(process.time-start,desc)))
I wonder if we should back port this change?
./gradlew :x-pack:qa:full-cluster-restart:with-system-key:v6.6.0#upgradedClusterTestRunner \
-Dtests.seed=D791D0C47DC7EAF4 \
-Dtests.class=org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT \
-Dtests.method="testMigration" \
-Dtests.security.manager=true \
-Dtests.locale=mt \
-Dtests.timezone=America/Nipigon \
-Dcompiler.java=11 \
-Druntime.java=8
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+bwc-tests/125/console
I backported #37077 to 6.6 to fix the reproducible failure that occurs when testing upgrade from 6.6.0 to 6.6.1.
The 6.x failure must be something different as that already had #37077.
One of the 6.x failures is https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/376/consoleText
The error is:
> Throwable #1: org.junit.ComparisonFailure: expected:<[open]ed> but was:<[clos]ed>
> at __randomizedtesting.SeedInfo.seed([3C4D9B41A53DB36A:FFF19BBDEA59A909]:0)
> at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForJobToBeAssigned$0(MlMigrationFullClusterRestartIT.java:181)
> at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
> at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.waitForJobToBeAssigned(MlMigrationFullClusterRestartIT.java:172)
> at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.upgradedClusterTests(MlMigrationFullClusterRestartIT.java:146)
> at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.testMigration(MlMigrationFullClusterRestartIT.java:87)
> at java.lang.Thread.run(Thread.java:748)
The root cause seems to be a race. From v5.6.10#upgradedClusterTestCluster node0/elasticsearch-6.7.0-SNAPSHOT/logs/full-cluster-restart.log
:
[2019-01-30T22:06:25,714][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [migration-old-cluster-open-job] [autodetect/21203] [CResourceMonitor.cc@68] Setting model memory limit to 4096 MB
...
[2019-01-30T22:06:27,641][INFO ][o.e.x.m.j.p.a.ProcessContext] [node-0] Killing job [migration-old-cluster-open-job], because [task has been removed, cancelling locally]
[2019-01-30T22:06:27,649][WARN ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-0] [migration-old-cluster-open-job] some results not processed due to the process being killed
[2019-01-30T22:06:27,653][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [controller/20292] [CDetachedProcessSpawner.cc@177] Child process with PID 21203 was terminated by signal 15
[2019-01-30T22:06:27,653][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [node-0] Received ban for the parent [lCoBrFUwS42T5jSfuPOm4A:890] on the node [lCoBrFUwS42T5jSfuPOm4A], reason: [task has been removed, cancelling locally]
...
[2019-01-30T22:06:29,935][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [migration-old-cluster-open-job] [autodetect/21319] [CResourceMonitor.cc@68] Setting model memory limit to 4096 MB
...
[2019-01-30T22:06:30,137][ERROR][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Could not set job state to [opened] for job [migration-old-cluster-open-job]
org.elasticsearch.ResourceNotFoundException: the task with id job-migration-old-cluster-open-job and allocation id 7 doesn't exist
at org.elasticsearch.persistent.PersistentTasksClusterService$4.execute(PersistentTasksClusterService.java:235) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:642) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:271) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:201) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:136) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Or in plain English:
migration-old-cluster-open-job
running on it, so starts a process for migration-old-cluster-open-job
migration-old-cluster-open-job
or the persistent task is not assigned to this node, so kills the process started in 1.migration-old-cluster-open-job
running on it, so starts another process for migration-old-cluster-open-job
I worry that this is a real bug that would affect customers on 6.7. Immediately before the first attempt at starting the job we have:
[2019-01-30T22:06:24,641][INFO ][o.e.x.m.MlConfigMigrator ] [node-0] ml job configurations migrated: [migration-old-cluster-closed-job, migration-old-cluster-open-job]
[2019-01-30T22:06:24,641][INFO ][o.e.x.m.MlConfigMigrator ] [node-0] ml datafeed configurations migrated: [migration-old-cluster-started-datafeed, migration-old-cluster-stopped-datafeed]
[2019-01-30T22:06:24,695][WARN ][o.e.x.w.WatcherService ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,775][WARN ][o.e.x.w.WatcherService ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,831][WARN ][o.e.x.w.WatcherService ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,833][WARN ][o.e.x.m.MlAssignmentNotifier] [node-0] [migration-old-cluster-open-job] No node found to start datafeed [migration-old-cluster-started-datafeed]. Reasons []
[2019-01-30T22:06:24,878][WARN ][o.e.x.w.WatcherService ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,895][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Opening job [migration-old-cluster-open-job]
We rewrite the persistent tasks in that migration step, and there could be a race there.
Thankfully the step of migrating configs for open jobs is new in 6.7, so not released yet. We need to scrutinise in detail the code that rewrites the persistent tasks when migrating the open job configs before 6.7 is released.
The race is probably not that likely, as the BWC tests had iterated from 5.6.0 to 5.6.10 before hitting this problem, and there's no reason to think it's unique to upgrading from 5.6.10. So that suggests something like a 9% chance of the race occurring (1 in 11).
@droberts195 I see several others of these over the weekend on 6.x and master as well, e.g.
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/394/console
org.junit.ComparisonFailure: expected:<[open]ed> but was:<[clos]ed>
at __randomizedtesting.SeedInfo.seed([37F2A222C9FBA52A:F44EA2DE869FBF49]:0)
at org.junit.Assert.assertEquals(Assert.java:115)
at org.junit.Assert.assertEquals(Assert.java:144)
at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForJobToBeAssigned$0(MlMigrationFullClusterRestartIT.java:181)
at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.waitForJobToBeAssigned(MlMigrationFullClusterRestartIT.java:172)
and also: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1796/console https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/389/console https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/388/console https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=sles-12/216/console
Should we mute this or are you close to a fix?
@cbuescher please can you mute it in master and 6.x? I think it's going to be at least a day before I can have a fix ready to merge.
Mutes on master and 6.x
I found the problem, and thankfully it is not in the code that we expect production users will run. The problem is that FullClusterRestartIT.testSnapshotRestore
restores a snapshot with include_global_state=true
. Depending on the order the tests run, this can wipe out the persistent tasks in the cluster state, or cause them to be different to the ones the PersistentTasksNodeService
has recorded as currently assigned to a node, thus causing a reassignment.
(The fact that restoring global cluster state resets persistent tasks to whatever state they were in at the time the snapshot was taken has been the case ever since persistent tasks were introduced. So this is an effect that production users can experience, but ML config migration doesn't change it one way or the other.)
The tests don't run in parallel so you may think that they should not interact in this way. But MlMigrationFullClusterRestartIT.testMigration
is relying on behaviour that is performed automatically on any cluster state update. So the assertions that MlMigrationFullClusterRestartIT.testMigration
makes are not looking for the result of actions taken by the test itself, but looking for effects of the cluster starting up on the newer version and doing anything that results in a cluster state update.
The possible solutions I can think of are:
FullClusterRestartIT.testSnapshotRestore
always runs first in the old cluster and last in the upgraded cluster, so that it does not interfere with the state of persistent tasks that other tests rely on (or at least not during the period when those other tests are making assertions that rely on that state)FullClusterRestartIT.testSnapshotRestore
into a separate test suiteMlMigrationFullClusterRestartIT.testMigration
into a separate test suite *FullClusterRestartIT.testSnapshotRestore
from the X-Pack full cluster restart tests - it's one of the tests that's already run in the OSS full cluster restart tests and duplicated in X-Pack to test with security (so this would mean testing snapshots only without security)include_global_state
into more fine-grained options, so that it's possible to restore just cluster settings, or just index templates, or just persistent tasks* I don't think this is a good option, as meddling with persistent tasks could be affecting other tests too. For example, #32773 might be caused by the same side effects of FullClusterRestartIT.testSnapshotRestore
.
For future reference, the log message that alerted me to what was happening was Can't read metadata from store, will not reuse local files during restore
- I'm not sure if this is expected, and if it's not and that bug is fixed then I don't think there would have been any sign of the root cause of the ML migration test failure in the logs...
For test purposes, option 2 seems the easiest/safest. Since we know it could drastically alter other tests, keeping it isolated seems best. Option 1 and 4 would both work but seem more fragile?
Option 5 would probably be a nice feature to have in general. I could see people wanting to selectively restore different parts of the cluster state, potentially. But probably bigger impact change, and wouldn't help retroactively.
~@droberts195 testMappingsUpgrade
now fails reproducible for me on master for a number of seeds.~
~Would it be worth muting this ? I am unsure how other people's PR runs are passing~
Apologies, this was a red herring. The real culprit was unrelated to this issue
> Warnings: [[types removal] The response format of get index template requests will change in the next major version. Please start using the `include_type_name` parameter set to `false` in the request to move to the new, typeless response format that will be the default in 7.0.]
which causes all tests that implement AbstractUpgradeTestCase
to fail
Pinging @elastic/ml-core (Team:ML)
This issue has been closed because it has been open for too long with no activity.
Any muted tests that were associated with this issue have been unmuted.
If the tests begin failing again, a new issue will be opened, and they may be muted again.
This issue is getting re-opened because there are still AwaitsFix mutes for the given test. It will likely be closed again in the future.
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/867/console
Reproduces for me with a variety of seeds.
I'm going to mute the test.