Closed ywelsch closed 1 year ago
Pinging @elastic/ml-core (:ml)
happen again on master https://gradle-enterprise.elastic.co/s/2zk4lmgz2gy6i
it looks like during that run indeed there was less then 200mb available https://infra-stats.elastic.co/app/kibana#/discover?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:'2020-07-03T07:00:00.000Z',to:'2020-07-03T07:30:00.000Z'))&_a=(columns:!(_source),index:'metricbeat-*',interval:auto,query:(language:kuery,query:elasticsearch-ci-immutable-u-1804-pkg-large-1593757207748855610),sort:!('@timestamp',desc))
@hendrikmuhs do you think we could add more debug information for this test? How much was disk available, what was the partition it was checking etc?
All the failures are on "EAR" builds: encryption at rest.
It may mean that disk space is not the problem, but something to do with the encryption that's being misreported.
If you look at the C++ code that detects lack of disk space it returns false
(on line 495) if it cannot obtain the amount of free disk space: https://github.com/elastic/ml-cpp/blob/edafba5165656ef617074e332268d7bdaba23922/lib/api/CForecastRunner.cc#L488-L499
This will get reported to the Java in the same way as actual lack of disk space, hence the misleading error message.
If there's a problem with the Boost.FileSystem library and the way EAR is implemented then we need to add a known issue that ML forecasting cannot be used if you are using EAR.
I looked for "Failed to retrieve disk information for" in the server side logs of https://gradle-enterprise.elastic.co/s/2zk4lmgz2gy6i (obtained from https://storage.cloud.google.com/elasticsearch-ci-artifacts/jobs/elastic+elasticsearch+master+EAR/build/75.tar.bz2) and didn't see it, so either forecasting is looking for space on a partition that has very little available, or else the Boost.FileSystem library "succeeds" in getting free disk space but gets an incorrect amount.
I opened elastic/ml-cpp#1380 to add more logging to find out exactly which partition we're checking and how much space Boost.FileSystem thinks it has.
This finally failed again in https://gradle-enterprise.elastic.co/s/l3x63bdggnqw6 and the new piece of debug reveals this:
[2020-07-15T19:25:50,227][WARN ][o.e.x.m.p.l.CppLogMessageHandler] [integTest-0] [forecast-it-test-overflow-to-disk] [autodetect/347069] [CForecastRunner.cc@499] Checked disk space for "/mnt/secret/elastic+elasticsearch+master+EAR/x-pack/plugin/ml/qa/native-multi-node-tests/build/testclusters/integTest-0/data/ml-local-data/tmp/job-forecast-it-test-overflow-to-disk" - required: 4294967296, available: 3305226240
[2020-07-15T19:25:50,227][ERROR][o.e.x.m.p.l.CppLogMessageHandler] [integTest-0] [forecast-it-test-overflow-to-disk] [autodetect/347069] [CForecastRunner.cc@518] Forecast cannot be executed as models exceed internal memory limit and available disk space is insufficient
4294967296 is exactly 2 ^ 32
So something funny has happened when calculating the space required...
The full server side logs can be downloaded from https://storage.cloud.google.com/elasticsearch-ci-artifacts/jobs/elastic+elasticsearch+master+EAR/build/100.tar.bz2
We found the problem. There is a limit of 4GB in the C++ code that dates back to when the Java limit was hardcoded to 5GB.
Now that the Java limit is configurable via a setting we need to pass it through to the C++ in the forecast request so that the C++ can do a check that makes sense given what was configured. #57254 and elastic/ml-cpp#1238 are an example of how to add an extra piece of information to the forecast request. So we need a similar change that passes through the value of xpack.ml.min_disk_space_off_heap
at the time the forecast is requested in the forecast request. For consistency with the original ratio the C++ should then check that 80% of that disk space is available instead of its hardcoded 4GB limit.
Just fyi: another one today https://gradle-enterprise.elastic.co/s/qknibt7pplgzi
And another one https://gradle-enterprise.elastic.co/s/5tbhtrr6d4b44
This has started happening in non-EAR builds occasionally now, for example https://gradle-enterprise.elastic.co/s/2rqpqaxoad4cm
Presumably something else is using up more space on the RAM disk that is used for the ES data directory in CI, and as a result there isn't always 4GB free as was expected.
I opened elastic/ml-cpp#1554 to move along the idea outlined in https://github.com/elastic/elasticsearch/issues/58806#issuecomment-659393024.
After making the changes of #64720 this test has not failed on master.
However, on 7.x it's now consistently failing in EAR builds due to the error from this bit of the C++ code:
if (totalMemoryUsage >= forecastJob.s_MaxForecastModelMemory &&
forecastJob.s_TemporaryFolder.empty()) {
this->sendErrorMessage(
forecastJob, "Forecast cannot be executed as forecast memory usage is predicted to exceed " +
std::to_string(forecastJob.s_MaxForecastModelMemory) +
" bytes while disk space is exceeded");
return false;
}
The message there is highly misleading, because it's not that disk space was exceeded but that no directory for overflow to disk was specified.
So something is happening in EAR builds on 7.x that is stopping an overflow directory being specified correctly.
The answer to why this isn't also happening on master is that for the last 9 days every single EAR build on master has failed due to lack of disk space before this test gets run - I opened #64763 for this.
So something is happening in EAR builds on 7.x that is stopping an overflow directory being specified correctly.
It turns out this is most likely because the Java side of forecasting decided there wasn't enough disk space. I have opened #64766 which will provide extra logging to confirm this next time the test fails.
This has failed again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+EAR/336/consoleFull (build scan). We can see now the additional log lines added in https://github.com/elastic/elasticsearch/pull/64766:
03:03:16 » ERROR][o.e.x.m.p.l.CppLogMessageHandler] [javaRestTest-2] [categorization-stop-on-warn] [autodetect/525518] [CAnomalyJob.cc@1457] Cannot report memory usage because last finalized bucket end time (0) is smaller than bucket span (3600)
03:03:16 » WARN ][o.e.x.m.p.NativeStorageProvider] [javaRestTest-2] Failed to find native storage for [job-forecast-it-test-memory-status], returning null
03:03:16 » ERROR][o.e.x.m.p.l.CppLogMessageHandler] [javaRestTest-2] [forecast-it-test-memory-status] [autodetect/525623] [CForecastRunner.cc@487] Forecast cannot be executed as model memory status is not OK
03:03:16 » WARN ][o.e.x.m.p.NativeStorageProvider] [javaRestTest-2] Failed to find native storage for [job-forecast-it-test-overflow-to-disk], returning null
03:03:16 » ERROR][o.e.x.m.p.l.CppLogMessageHandler] [javaRestTest-2] [forecast-it-test-overflow-to-disk] [autodetect/525642] [CForecastRunner.cc@487] Forecast cannot be executed as forecast memory usage is predicted to exceed 20971520 bytes while disk space is exceeded
03:03:16 » WARN ][o.e.x.m.p.NativeStorageProvider] [javaRestTest-2] Failed to find native storage for [job-forecast-it-test-delete], returning null
03:03:16 » ↑ repeated 4 times ↑
03:03:16 » WARN ][o.e.x.m.p.NativeStorageProvider] [javaRestTest-2] Failed to find native storage for [job-forecast-it-test-failed-on-force-stop], returning null
03:03:16 » WARN ][o.e.x.m.p.NativeStorageProvider] [javaRestTest-2] Failed to find native storage for [job-forecast-it-test-single-series], returning null
03:03:16 » ↑ repeated 4 times ↑
03:03:16 » WARN ][o.e.x.m.p.NativeStorageProvider] [javaRestTest-2] Failed to find native storage for [job-forecast-it-test-delete-wildcard], returning null
03:03:16 » ↑ repeated 4 times ↑
03:03:16 » WARN ][o.e.x.m.p.NativeStorageProvider] [javaRestTest-2] Failed to find native storage for [job-forecast-it-test-no-data], returning null
03:03:16 » ERROR][o.e.x.m.p.l.CppLogMessageHandler] [javaRestTest-2] [forecast-it-test-no-data] [autodetect/526484] [CForecastRunner.cc@487] Forecast cannot be executed as job requires data to have been processed and modeled
@droberts195 can you please have a look?
Thanks for pasting the extra log messages.
I re-muted this test on 7.x in 7bc77837d799406cfb0285b969076640eda7d870
The problem is that the machines that run the EAR builds don't make enough disk space available in the partitions that CI needs it in, i.e. the problem of #64763. On master the EAR builds fail every time due to lack of disk space before reaching this test. On 7.x they often get as far as this test, which then causes noise (when it's unmuted) rather than the silent failures of master. But in both cases the fundamental problem is that there isn't enough disk space to run the ES test suites with encryption at rest.
This test also failed today on 7.x: https://gradle-enterprise.elastic.co/s/c2ccv6hst7hhs/tests/:x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest/org.elasticsearch.xpack.ml.integration.ForecastIT/testOverflowToDisk#1
But the error is now different than has been reported here:
java.lang.AssertionError: |
-- | --
| Expected: one of {<finished>} |
| but: was <scheduled> |
at __randomizedtesting.SeedInfo.seed([6FE70EAB878C0D9B:2037B0F1EEF2042B]:0) |
-- | --
| | at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) |
| | at org.junit.Assert.assertThat(Assert.java:956) |
| | at org.junit.Assert.assertThat(Assert.java:923) |
| | at org.elasticsearch.xpack.ml.integration.MlNativeAutodetectIntegTestCase.lambda$waitForecastStatus$1(MlNativeAutodetectIntegTestCase.java:301) |
| | at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1016) |
| | at org.elasticsearch.xpack.ml.integration.MlNativeAutodetectIntegTestCase.waitForecastStatus(MlNativeAutodetectIntegTestCase.java:298) |
| | at org.elasticsearch.xpack.ml.integration.MlNativeAutodetectIntegTestCase.waitForecastToFinish(MlNativeAutodetectIntegTestCase.java:292)
Does this different value need a new issue?
Does this different value need a new issue?
It's not the same as the original one. The original problem was that some workers have very little disk space, and that was making the test that uses disk space fail.
This new one is that a forecast took more than 1 minute to complete when we expect it to complete within 1 minute. The reason for that will be that 16 gradle tasks were running at the same time, several of which were heavyweight integration test suites - see https://gradle-enterprise.elastic.co/s/c2ccv6hst7hhs/timeline. Maybe because these tests were also using the FIPS JVM they were even slower than usual. But it's a different problem to the disk space issue.
The issue for this test failing in the FIPS JVM due to slowness is #63793. Let's keep this one for the problem of not enough disk space.
I'm going to close this because the change that was made in https://github.com/elastic/elasticsearch/commit/7bc77837d799406cfb0285b969076640eda7d870 to not run forecasting overflow to disk on EAR builds is as good as a fix is going to get.
This test has failed three times in the past 2 weeks
https://gradle-enterprise.elastic.co/s/mp3vxqny35j4e
Output:
I'm not too familiar with this part of the system, but wonder if we truly have less than 200mb available on disk when running the test (in that case we could possibly raise it as infrastructural issue), or if we should perhaps have the test skipped in case where these conditions are not met.