elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
861 stars 24.8k forks source link

[CI] ForecastIT.testOverflowToDisk failing on EAR builds #58806

Closed ywelsch closed 1 year ago

ywelsch commented 4 years ago

This test has failed three times in the past 2 weeks

https://gradle-enterprise.elastic.co/s/mp3vxqny35j4e

Output:

org.elasticsearch.xpack.ml.integration.ForecastIT > testOverflowToDisk FAILED
    org.elasticsearch.ElasticsearchStatusException: Test likely fails due to insufficient disk space on test machine, please free up space.
        at __randomizedtesting.SeedInfo.seed([8263E7B83540FE6E:CDB359E25C3EF7DE]:0)
        at org.elasticsearch.xpack.ml.integration.ForecastIT.testOverflowToDisk(ForecastIT.java:244)

        Caused by:
        org.elasticsearch.ElasticsearchStatusException: Cannot run forecast: Forecast cannot be executed as models exceed internal memory limit and available disk space is insufficient Minimum disk space required: [200mb]
            at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.badRequestException(ExceptionsHelper.java:75)
            at org.elasticsearch.xpack.ml.action.TransportForecastJobAction.lambda$getForecastRequestStats$2(TransportForecastJobAction.java:125)
            at org.elasticsearch.xpack.ml.job.persistence.JobResultsProvider.lambda$getForecastRequestStats$40(JobResultsProvider.java:1368)
            at org.elasticsearch.xpack.ml.job.persistence.JobResultsProvider.lambda$searchSingleResult$33(JobResultsProvider.java:1165)
            at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)
            at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)
            at org.elasticsearch.client.node.NodeClient.lambda$executeLocally$0(NodeClient.java:91)
            at org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:175)
            at org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:169)
            at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)
            at org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:545)
            at org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:117)
            at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:350)
            at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:344)
            at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:231)
            at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$1(FetchSearchPhase.java:119)
            at org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:125)
            at org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:95)
            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
            at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:706)
            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
            at java.lang.Thread.run(Thread.java:834)

I'm not too familiar with this part of the system, but wonder if we truly have less than 200mb available on disk when running the test (in that case we could possibly raise it as infrastructural issue), or if we should perhaps have the test skipped in case where these conditions are not met.

elasticmachine commented 4 years ago

Pinging @elastic/ml-core (:ml)

pgomulka commented 4 years ago

happen again on master https://gradle-enterprise.elastic.co/s/2zk4lmgz2gy6i

it looks like during that run indeed there was less then 200mb available https://infra-stats.elastic.co/app/kibana#/discover?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:'2020-07-03T07:00:00.000Z',to:'2020-07-03T07:30:00.000Z'))&_a=(columns:!(_source),index:'metricbeat-*',interval:auto,query:(language:kuery,query:elasticsearch-ci-immutable-u-1804-pkg-large-1593757207748855610),sort:!('@timestamp',desc))

pgomulka commented 4 years ago

@hendrikmuhs do you think we could add more debug information for this test? How much was disk available, what was the partition it was checking etc?

droberts195 commented 4 years ago

All the failures are on "EAR" builds: encryption at rest.

It may mean that disk space is not the problem, but something to do with the encryption that's being misreported.

If you look at the C++ code that detects lack of disk space it returns false (on line 495) if it cannot obtain the amount of free disk space: https://github.com/elastic/ml-cpp/blob/edafba5165656ef617074e332268d7bdaba23922/lib/api/CForecastRunner.cc#L488-L499

This will get reported to the Java in the same way as actual lack of disk space, hence the misleading error message.

If there's a problem with the Boost.FileSystem library and the way EAR is implemented then we need to add a known issue that ML forecasting cannot be used if you are using EAR.

droberts195 commented 4 years ago

I looked for "Failed to retrieve disk information for" in the server side logs of https://gradle-enterprise.elastic.co/s/2zk4lmgz2gy6i (obtained from https://storage.cloud.google.com/elasticsearch-ci-artifacts/jobs/elastic+elasticsearch+master+EAR/build/75.tar.bz2) and didn't see it, so either forecasting is looking for space on a partition that has very little available, or else the Boost.FileSystem library "succeeds" in getting free disk space but gets an incorrect amount.

I opened elastic/ml-cpp#1380 to add more logging to find out exactly which partition we're checking and how much space Boost.FileSystem thinks it has.

droberts195 commented 4 years ago

This finally failed again in https://gradle-enterprise.elastic.co/s/l3x63bdggnqw6 and the new piece of debug reveals this:

[2020-07-15T19:25:50,227][WARN ][o.e.x.m.p.l.CppLogMessageHandler] [integTest-0] [forecast-it-test-overflow-to-disk] [autodetect/347069] [CForecastRunner.cc@499] Checked disk space for "/mnt/secret/elastic+elasticsearch+master+EAR/x-pack/plugin/ml/qa/native-multi-node-tests/build/testclusters/integTest-0/data/ml-local-data/tmp/job-forecast-it-test-overflow-to-disk" - required: 4294967296, available: 3305226240
[2020-07-15T19:25:50,227][ERROR][o.e.x.m.p.l.CppLogMessageHandler] [integTest-0] [forecast-it-test-overflow-to-disk] [autodetect/347069] [CForecastRunner.cc@518] Forecast cannot be executed as models exceed internal memory limit and available disk space is insufficient

4294967296 is exactly 2 ^ 32

So something funny has happened when calculating the space required...

The full server side logs can be downloaded from https://storage.cloud.google.com/elasticsearch-ci-artifacts/jobs/elastic+elasticsearch+master+EAR/build/100.tar.bz2

droberts195 commented 4 years ago

We found the problem. There is a limit of 4GB in the C++ code that dates back to when the Java limit was hardcoded to 5GB.

Now that the Java limit is configurable via a setting we need to pass it through to the C++ in the forecast request so that the C++ can do a check that makes sense given what was configured. #57254 and elastic/ml-cpp#1238 are an example of how to add an extra piece of information to the forecast request. So we need a similar change that passes through the value of xpack.ml.min_disk_space_off_heap at the time the forecast is requested in the forecast request. For consistency with the original ratio the C++ should then check that 80% of that disk space is available instead of its hardcoded 4GB limit.

cbuescher commented 4 years ago

Just fyi: another one today https://gradle-enterprise.elastic.co/s/qknibt7pplgzi

probakowski commented 4 years ago

And another one https://gradle-enterprise.elastic.co/s/5tbhtrr6d4b44

droberts195 commented 4 years ago

60040 is an attempt to mute this test in all EAR builds but leave it running in non-EAR builds.

droberts195 commented 3 years ago

This has started happening in non-EAR builds occasionally now, for example https://gradle-enterprise.elastic.co/s/2rqpqaxoad4cm

Presumably something else is using up more space on the RAM disk that is used for the ES data directory in CI, and as a result there isn't always 4GB free as was expected.

I opened elastic/ml-cpp#1554 to move along the idea outlined in https://github.com/elastic/elasticsearch/issues/58806#issuecomment-659393024.

droberts195 commented 3 years ago

After making the changes of #64720 this test has not failed on master.

However, on 7.x it's now consistently failing in EAR builds due to the error from this bit of the C++ code:

        if (totalMemoryUsage >= forecastJob.s_MaxForecastModelMemory &&
            forecastJob.s_TemporaryFolder.empty()) {
            this->sendErrorMessage(
                forecastJob, "Forecast cannot be executed as forecast memory usage is predicted to exceed " +
                                 std::to_string(forecastJob.s_MaxForecastModelMemory) +
                                 " bytes while disk space is exceeded");
            return false;
        }

The message there is highly misleading, because it's not that disk space was exceeded but that no directory for overflow to disk was specified.

So something is happening in EAR builds on 7.x that is stopping an overflow directory being specified correctly.

The answer to why this isn't also happening on master is that for the last 9 days every single EAR build on master has failed due to lack of disk space before this test gets run - I opened #64763 for this.

droberts195 commented 3 years ago

So something is happening in EAR builds on 7.x that is stopping an overflow directory being specified correctly.

It turns out this is most likely because the Java side of forecasting decided there wasn't enough disk space. I have opened #64766 which will provide extra logging to confirm this next time the test fails.

danielmitterdorfer commented 3 years ago

This has failed again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+EAR/336/consoleFull (build scan). We can see now the additional log lines added in https://github.com/elastic/elasticsearch/pull/64766:

03:03:16 » ERROR][o.e.x.m.p.l.CppLogMessageHandler] [javaRestTest-2] [categorization-stop-on-warn] [autodetect/525518] [CAnomalyJob.cc@1457] Cannot report memory usage because last finalized bucket end time (0) is smaller than bucket span (3600)
03:03:16 » WARN ][o.e.x.m.p.NativeStorageProvider] [javaRestTest-2] Failed to find native storage for [job-forecast-it-test-memory-status], returning null
03:03:16 » ERROR][o.e.x.m.p.l.CppLogMessageHandler] [javaRestTest-2] [forecast-it-test-memory-status] [autodetect/525623] [CForecastRunner.cc@487] Forecast cannot be executed as model memory status is not OK
03:03:16 » WARN ][o.e.x.m.p.NativeStorageProvider] [javaRestTest-2] Failed to find native storage for [job-forecast-it-test-overflow-to-disk], returning null
03:03:16 » ERROR][o.e.x.m.p.l.CppLogMessageHandler] [javaRestTest-2] [forecast-it-test-overflow-to-disk] [autodetect/525642] [CForecastRunner.cc@487] Forecast cannot be executed as forecast memory usage is predicted to exceed 20971520 bytes while disk space is exceeded
03:03:16 » WARN ][o.e.x.m.p.NativeStorageProvider] [javaRestTest-2] Failed to find native storage for [job-forecast-it-test-delete], returning null
03:03:16 »   ↑ repeated 4 times ↑
03:03:16 » WARN ][o.e.x.m.p.NativeStorageProvider] [javaRestTest-2] Failed to find native storage for [job-forecast-it-test-failed-on-force-stop], returning null
03:03:16 » WARN ][o.e.x.m.p.NativeStorageProvider] [javaRestTest-2] Failed to find native storage for [job-forecast-it-test-single-series], returning null
03:03:16 »   ↑ repeated 4 times ↑
03:03:16 » WARN ][o.e.x.m.p.NativeStorageProvider] [javaRestTest-2] Failed to find native storage for [job-forecast-it-test-delete-wildcard], returning null
03:03:16 »   ↑ repeated 4 times ↑
03:03:16 » WARN ][o.e.x.m.p.NativeStorageProvider] [javaRestTest-2] Failed to find native storage for [job-forecast-it-test-no-data], returning null
03:03:16 » ERROR][o.e.x.m.p.l.CppLogMessageHandler] [javaRestTest-2] [forecast-it-test-no-data] [autodetect/526484] [CForecastRunner.cc@487] Forecast cannot be executed as job requires data to have been processed and modeled

@droberts195 can you please have a look?

droberts195 commented 3 years ago

Thanks for pasting the extra log messages.

I re-muted this test on 7.x in 7bc77837d799406cfb0285b969076640eda7d870

The problem is that the machines that run the EAR builds don't make enough disk space available in the partitions that CI needs it in, i.e. the problem of #64763. On master the EAR builds fail every time due to lack of disk space before reaching this test. On 7.x they often get as far as this test, which then causes noise (when it's unmuted) rather than the silent failures of master. But in both cases the fundamental problem is that there isn't enough disk space to run the ES test suites with encryption at rest.

martijnvg commented 3 years ago

This test also failed today on 7.x: https://gradle-enterprise.elastic.co/s/c2ccv6hst7hhs/tests/:x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest/org.elasticsearch.xpack.ml.integration.ForecastIT/testOverflowToDisk#1

But the error is now different than has been reported here:


java.lang.AssertionError: |  
-- | --
  | Expected: one of {<finished>} |  
  | but: was <scheduled> |  

at __randomizedtesting.SeedInfo.seed([6FE70EAB878C0D9B:2037B0F1EEF2042B]:0) |  
-- | --
  |   | at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) |  
  |   | at org.junit.Assert.assertThat(Assert.java:956) |  
  |   | at org.junit.Assert.assertThat(Assert.java:923) |  
  |   | at org.elasticsearch.xpack.ml.integration.MlNativeAutodetectIntegTestCase.lambda$waitForecastStatus$1(MlNativeAutodetectIntegTestCase.java:301) |  
  |   | at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1016) |  
  |   | at org.elasticsearch.xpack.ml.integration.MlNativeAutodetectIntegTestCase.waitForecastStatus(MlNativeAutodetectIntegTestCase.java:298) |  
  |   | at org.elasticsearch.xpack.ml.integration.MlNativeAutodetectIntegTestCase.waitForecastToFinish(MlNativeAutodetectIntegTestCase.java:292)

Does this different value need a new issue?

droberts195 commented 3 years ago

Does this different value need a new issue?

It's not the same as the original one. The original problem was that some workers have very little disk space, and that was making the test that uses disk space fail.

This new one is that a forecast took more than 1 minute to complete when we expect it to complete within 1 minute. The reason for that will be that 16 gradle tasks were running at the same time, several of which were heavyweight integration test suites - see https://gradle-enterprise.elastic.co/s/c2ccv6hst7hhs/timeline. Maybe because these tests were also using the FIPS JVM they were even slower than usual. But it's a different problem to the disk space issue.

droberts195 commented 3 years ago

The issue for this test failing in the FIPS JVM due to slowness is #63793. Let's keep this one for the problem of not enough disk space.

droberts195 commented 1 year ago

I'm going to close this because the change that was made in https://github.com/elastic/elasticsearch/commit/7bc77837d799406cfb0285b969076640eda7d870 to not run forecasting overflow to disk on EAR builds is as good as a fix is going to get.