Closed droberts195 closed 2 years ago
Pinging @elastic/ml-core (Team:ML)
This happened on an Ubuntu 22.04 worker. It almost certainly means the system call filter in the ML native processes needs adjusting for a new kernel version.
/cc @bytebilly please don't add Ubuntu 22.04 to the Elasticsearch support matrix until this issue is fixed. It seems that until this is fixed ML is completely broken on this distribution. I will aim for 7.17.5/8.2.2/8.3.0.
@droberts195 are we now good to add Ubuntu 22.04 to the general rotation as well as the testing matrix for 7.17?
The ML native processes will now work on Ubuntu 22.04 starting with 8.3.0, 8.2.2 and 7.17.5. But they'll never work for older versions. This is going to be problematic with the BWC tests. Any build that runs the X-Pack BWC tests against versions older than 8.3.0/8.2.2/7.17.5 on Ubuntu 22.04 is going to fail, and since we can't re-release those old versions that's going to be a problem forever.
Therefore we should probably do two things:
ldconfig --version
) is 2.35 or aboveIt's interesting that this has come about because our system call filtering (which was added to improve security/reduce attack surface in the event of a breach) has also defeated the Linux developers' BWC efforts. You'd expect a recent version of Linux to run all the software that older versions from the previous few years could run, and usually this would be the case with Ubuntu 22.04 and Ubuntu 20.04, but our system call filter prevents it. If we keep the system call filter then this is going to happen again in the future.
@droberts195 are we now ok to add Ubuntu 22.04 to the support matrix of supported operating systems for 8.3/7.17?
are we now ok to add Ubuntu 22.04 to the support matrix of supported operating systems for 8.3/7.17?
8.3 is fine. 7.17 needs to specifically say 7.17.5 and above. 7.17.0-7.17.4 will never work.
The matrix doesn't have this granularity, so I added a footnote to mention that
- Disable all the ML BWC tests if we detect the old version is before 8.3.0/8.2.2/7.17.5 and the glibc version (which can be got from
ldconfig --version
) is 2.35 or above
What's the best way to do this. Can we do this in the tests themselves in with assertions?
- Don't use Ubuntu 22.04 for PR builds, because otherwise ML BWC breakages will creep through into the periodic builds
This actually isn't a problem for PR builds since we only test snapshot versions there and those will include the fix. Only the periodic BWC builds are an issue.
What's the best way to do this. Can we do this in the tests themselves in with assertions?
Most of the BWC tests are YAML tests.
I think the best way to skip those ones would be to conditionally add an entry to tests.rest.blacklist
that is */*_ml_*/*
if the glibc version is 2.34 or above and the old version being upgraded from is < 7.17.5 or >= 8.0.0 and <= 8.2.2.
So to do that we'd somehow need to get Gradle to know the glibc version. It can be done by running ldd --version | grep '^ldd' | sed 's/.* \([1-9]\.[0-9]*\).*/\1/'
on Linux. Or obviously if it's easier just the ldd --version
can be run as an external command and the text processing can be done in the Gradle script.
Is it possible to make Gradle run an external command during the configuration phase rather than as a task?
Then there are also a few BWC tests that are written in Java rather than YAML. Like you say those can assumeFalse
on the glibc version if it can be made available to them. So maybe we just have Gradle set a system property that contains it to pass it through.
I don't think it will be too hard if you could just recommend the best way to get Gradle to run ldd --version
early enough that the configuration of the test tasks can know the answer.
Is it possible to make Gradle run an external command during the configuration phase rather than as a task?
It is, but it's highly discouraged since it's expensive to do so and adds overhead to every build invocation. That was my though behind doing this in the test itself, since we'd only do it when attempting to execute the test. I'm wondering if we could implement such a filter in JUnit, even for the YAML tests. I'll have a look at this.
Alternatively, since this only applies to the BWC jobs, maybe we could inject the glibc version as an environment variable or something so we don't have to shell out to ldd
during build configuration.
maybe we could inject the glibc version as an environment variable
Yes, that's a good idea. We could potentially add it to the per-worker Jenkins configuration for Linux workers. Then both the build.gradle
for the YAML tests and the Java test classes would be able to access it.
Another thing we could potentially do is have the early bootstrap of the Java code (before installing system call filters) call this function using JNA and store the result in a variable that's available to other code later on. That would work nicely for the Java tests. But for the YAML tests we'd need to implement a new type of skip
rule that could consider both glibc version and old cluster version. And that is problematic because all the client test harnesses have to understand the YAML syntax.
So, overall, adding a worker-specific environment variable is probably best.
@droberts195 Do we have an ehaustive list of all the test we should mute in this scenario. I notice that not all ML tests fail: https://gradle-enterprise.elastic.co/s/36sbiahfwjjqc/tests/overview?class=org.elasticsearch.xpack.test.rest.XPackRestIT&test=test%20%7Bp0%3Dml/*
Should we blanketly skip all ML tests in BWC scenarios or individual ones? I'm leaning towards the former so we don't find ourselves in a whack-a-mole situtation.
@mark-vieira yes, I agree we should mute all the ML BWC YAML tests when we detect the OS is too new for the old version to work. Otherwise, like you say, almost every newly added test is likely to need another iteration of observing failures, opening issues and adding to the list of tests to mute.
The ones that work currently will be the ones that don't use any ML C++ functionality. But those ones are unlikely to fail in platform-specific ways, so there's not much point adding extra complexity to test them on a distribution where the rest of ML doesn't work.
The problem is not related to any particular test. The
autodetect
process couldn't run due to permission denied starting a thread. The same thing happened every time it was run in this test suite:Build scan: https://gradle-enterprise.elastic.co/s/36sbiahfwjjqc/tests/:x-pack:plugin:yamlRestTest/org.elasticsearch.xpack.test.rest.XPackRestIT/test%20%7Bp0=ml%2Fjobs_crud%2FTest%20reopen%20job%20resets%20the%20finished%20time%7D
Reproduction line:
./gradlew ':x-pack:plugin:yamlRestTest' --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=ml/jobs_crud/Test reopen job resets the finished time}" -Dtests.seed=37660F157587F87B -Dtests.locale=da -Dtests.timezone=Asia/Thimphu -Druntime.java=18
Applicable branches: 8.2
Reproduces locally?: No
Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.test.rest.XPackRestIT&tests.test=test%20%7Bp0%3Dml/jobs_crud/Test%20reopen%20job%20resets%20the%20finished%20time%7D
Failure excerpt: