Closed imotov closed 4 years ago
Pinging @elastic/ml-core (:ml)
@williamrandolph ping
Just verified locally that #54816 is causing the log spam. It MIGHT also be causing the intermittent failure.
I have noticed in the past some weird behavior withFielddata breaker
checks and ingest pipelines. Node stats and memory allocations differ slightly with pipelines. I really only noticed this when pipelines were simulated in the test. I am not sure if the basic plugins being enabled are causing this intermittent behavior. My prime suspect is monitoring, but I am not sure.
History:
The IngestPipeline.Factory
is loaded up on ML nodes. It attempts to parse pipelines out and check if they are inference
pipelines. If they are, it keeps a local tally of them. This is so we can have a setting that puts a sane upper limit on these processors.
I think to fix this, we could enable all these ingest plugins/modules for the ML integration tests. That will get rid of the log spam at least.
@benwtrent Apologies for the log spam. I will try to fix it next week.
I think to fix this, we could enable all these ingest plugins/modules for the ML integration tests. That will get rid of the log spam at least.
I'm not sure which plugins and modules are not already enabled. Do we need to configure the ML nodes to be ingest nodes?
I am not sure off hand why the script
processor is loaded. But I think making sure it and the other processors referenced by the monitor plugin are loaded in the ML test cluster.
I don't know how to do this, but I can also take a look next week.
Ah, I just noticed that xpack.monitoring.elasticsearch.collection.enabled
has a default value of true
. I could try setting this to false
for all the tests that formerly had xpack.monitoring.enabled
set to false
.
We have had this "Accounting breaker not reset" error before in integration tests and the cause was that indices were being recreated simultaneously with the cleanup method that deletes all the indices in between tests.
I had a look at the server side logs for one of these failures, https://gradle-enterprise.elastic.co/s/r7ybldwftir2s/tests/zcquf3hc3eoc6-7mybzxk4lp3n2, to see which index it was that was created during cleanup.
In that test run the failing test was CategorizationIT.testCategorizationWithFilters
and it finished at 2020-04-19T00:53:15,933:
[2020-04-19T00:53:15,933][INFO ][o.e.x.m.i.CategorizationIT] [testCategorizationWithFilters] after test
Looking at the server side logs for what was going on with indices around that time we see this:
[2020-04-18T19:53:15,704][INFO ][o.e.x.i.IndexLifecycleTransition] [integTest-2] moving index [ilm-history-2-000001] from [{"phase":"new","action":"complete","name":"complete"}] to [{"phase":"hot","action":"unfollow","name":"wait-for-indexing-complete"}] in policy [ilm-history-ilm-policy]
[2020-04-18T19:53:15,819][INFO ][o.e.x.i.IndexLifecycleTransition] [integTest-2] moving index [ilm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"wait-for-indexing-complete"}] to [{"phase":"hot","action":"unfollow","name":"wait-for-follow-shard-tasks"}] in policy [ilm-history-ilm-policy]
The timezone for those messages is 5 hours behind the client timezone, so in reality they're just a about a 100-200ms before the test finished. The accounting breaker assertion tripped on node 1 whereas those messages are from node 2, so it's reasonable to think that node 1 would gain visibility of the creation of the ILM history index at the time the assertion tripped during the post test cleanup.
Async creation of internal indices has always been a problem for integration tests that try to remove all indices in between tests. Something similar is happening for templates in the same server-side log file:
[2020-04-18T19:53:14,242][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] removing template [.ml-anomalies-]
[2020-04-18T19:53:14,394][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] adding template [.ml-anomalies-] for index patterns [.ml-anomalies-*]
[2020-04-18T19:53:14,453][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] removing template [.ml-notifications-000001]
[2020-04-18T19:53:14,544][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] adding template [.ml-notifications-000001] for index patterns [.ml-notifications-000001]
[2020-04-18T19:53:14,598][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] removing template [.ml-inference-000002]
[2020-04-18T19:53:14,683][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] adding template [.ml-inference-000002] for index patterns [.ml-inference-000002]
[2020-04-18T19:53:14,736][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] removing template [.ml-state]
[2020-04-18T19:53:14,822][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] adding template [.ml-state] for index patterns [.ml-state*]
[2020-04-18T19:53:14,875][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] removing template [.ml-meta]
[2020-04-18T19:53:14,958][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] adding template [.ml-meta] for index patterns [.ml-meta]
[2020-04-18T19:53:15,014][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] removing template [random_index_template]
[2020-04-18T19:53:15,068][INFO ][o.e.x.i.IndexLifecycleTransition] [integTest-2] moving index [ilm-history-2-000001] from [null] to [{"phase":"new","action":"complete","name":"complete"}] in policy [ilm-history-ilm-policy]
[2020-04-18T19:53:15,122][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] removing template [.ml-stats]
[2020-04-18T19:53:15,227][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] adding template [.ml-stats] for index patterns [.ml-stats-*]
[2020-04-18T19:53:15,288][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] removing template [.ml-config]
[2020-04-18T19:53:15,370][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] adding template [.ml-config] for index patterns [.ml-config]
[2020-04-18T19:53:15,426][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] removing template [.slm-history]
[2020-04-18T19:53:15,511][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] adding template [.slm-history] for index patterns [.slm-history-2*]
[2020-04-18T19:53:15,565][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] removing template [ilm-history]
[2020-04-18T19:53:15,647][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-2] adding template [ilm-history] for index patterns [ilm-history-2*]
In between the tests the post test cleanup in the client removes all the templates and then the template auto-creation functionality in the server fights back immediately to recreate the same templates. 🤦
So I would guess that the reason #54816 has aggravated this is that some component that uses ILM is now enabled by default (it's not ILM itself as that was always enabled by default). I cannot work out what, as monitoring still seems to be disabled in the ML native multi-node tests, unless something overrides https://github.com/elastic/elasticsearch/blob/92c8a73348055e5784fc82a1c5e1d346f3167b1a/x-pack/plugin/ml/qa/native-multi-node-tests/src/test/java/org/elasticsearch/xpack/ml/integration/MlNativeIntegTestCase.java#L135
But when monitoring eventually is enabled, that will make the problem of test cleanup fighting async actions even worse. Therefore I think we need a new approach to cleanup in between the ML native multi-node integration tests: we should only wipe ML indices and data indices. We should leave the internal indices of other ES functionality like ILM, monitoring and security alone between tests. Also, there's no need for us to remove the ML index templates between the native multi-node tests, because we're not testing upgrade, so the templates will be the same for all the tests.
Unfortunately this change to cleanup logic will need to be done in two places, because the native multi-node tests are a mixture of REST tests and Java client tests - see #49582. So the (completely different) cleanup code used for both these types of ML native multi-node tests will need to be changed to be consistent 😬.
The problem with the script plugin not being available generating lots of log spam since #54816 still needs to be fixed, but that is separate to the problem of cleanup in between tests.
another one: https://gradle-enterprise.elastic.co/s/xhmbqjq2psh5c
(This time ScheduledEventsIT.testScheduledEventWithInterimResults
)
I've opened a PR that I hope will at least get rid of the log spam in the ML multinode tests: https://github.com/elastic/elasticsearch/pull/55461
Related build failure: https://gradle-enterprise.elastic.co/s/cyx2laofbzs72/tests/zcquf3hc3eoc6-iu7t2hi7meoko
Hopefully #55459 will fix this. Given how frequently it was failing before we should know within a half a day or so.
I haven't seen any failures of this type since #55459 passed the intake build, so it looks like it's fixed.
A couple of latest examples:
The error looks like this:
One common part, is that logs are filled with the following errors:
@benwtrent pointed out that it could be caused by #54816 but we still need to verify that. These messages might be as well a red herring.