elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.74k stars 24.68k forks source link

Elasticsearch fails to startup with valid fsize and virtual memory limits #113705

Open nathan-maves opened 1 week ago

nathan-maves commented 1 week ago

Elasticsearch Version

8.15

Installed Plugins

No response

Java Version

bundled

OS Version

Linux

Problem Description

We have found a few issues with the Bootstrap checks on Linux/Unix machines. The first is that a value of -1 should be accepted along with unlimited and infinity based on this documentation.

All items support the values -1, unlimited or infinity indicating no limit, except for priority, nice, and nonewprivs. If nofile is to be set to one of these values, it will be set to the contents of /proc/sys/fs/nr_open instead (see setrlimit(3)).

The second issue is that this code appears to be incorrect.

It should be checking for the max file size and NOT the max memory size

long getMaxFileSize() {
    return NativeAccess.instance().getProcessLimits().maxFileSize();
}

Steps to Reproduce

Set the fsize value to -1 in the /etc/security/limits.conf file then start up Elastic 8.15.x.

Logs (if relevant)

[2024-09-25T17:00:39,398][ERROR][o.e.b.Elasticsearch ] [node-f41036c7-370b-4665-a0d0-679e2bedef84] node validation exception
[2] bootstrap checks failed. You must address the points described in the following [2] lines before starting Elasticsearch. For more information see [https://www.elastic.co/guide/en/elasticsearch/reference/8.15/bootstrap-checks.html]
bootstrap check failure [1] of [2]: max size virtual memory [-1] for user [###] is too low, increase to [unlimited]; for more information see [https://www.elastic.co/guide/en/elasticsearch/reference/8.15/max-size-virtual-memory-check.html]
bootstrap check failure [2] of [2]: max file size [-1] for user [###] is too low, increase to [unlimited]; for more information see [https://www.elastic.co/guide/en/elasticsearch/reference/8.15/_max_file_size_check.html]
elasticsearchmachine commented 1 week ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

prdoyle commented 1 week ago

For the second part... maxVirtualMemorySize certainly looks like a bug. @rjernst - do you agree it should be calling maxFileSize? The code came from here.

prdoyle commented 1 week ago

For the first part, I think we want to change this Long.MIN_VALUE to just -1 like it is elsewhere.

prdoyle commented 1 week ago

I've changed it to accept both Long.MIN_VALUE and -1. A comment in the unit test seems to suggest that this value can be MIN_VALUE if the size is "not available".

nathan-maves commented 1 week ago

I think you might want to add the -1 to this check as well.

https://github.com/elastic/elasticsearch/blob/ee24f84df0265f5b8bc0baa3dadde6516cf3c073/server/src/main/java/org/elasticsearch/bootstrap/BootstrapChecks.java#L384

There could be others too.

prdoyle commented 1 week ago

I believe @rjernst is also looking at this now.

prdoyle commented 1 week ago

My original -1 fix was incorrect. The code is already supposed to be turning that -1 into ProcessLimits.UNLIMITED here (where constants.RLIMIT_INFINITY is defined to be -1L by the first parameter here.

prdoyle commented 1 week ago

@nathan-maves - what happens if you try to use -1?

nathan-maves commented 1 week ago

You can see in the logs I added to the issue we already have the system set to -1. So the code is reading in the value of -1 and telling us that it is too low.

max file size [-1] for user [####] is too low

rjernst commented 1 week ago

Are there any other log messages like unable to retrieve max size virtual memory? We translate the RLIMIT_INFINITY value for each system into our own (which is represented by MAX_INT). I suspect what is happening here is the rlimit call failed, which then stores our own UNKNOWN (-1), but the bootstrap checks aren't currently specializing the error message for that case, so it looks as if -1 was not handled.

rjernst commented 1 week ago

The issue description mentiones "Unix" as the OS. Do you mean linux, and if so, what distribution? We do not support any Unix distributions, and I can see how that might be an issue (our rlimit calls are probably not setup right for unix).

nathan-maves commented 1 week ago

That was my bad. I am pretty sure we support RHEL and rocky linux.

prdoyle commented 1 week ago

The merged PR only fixes one of the reported problems. The "-1 problem" still exists.

nathan-maves commented 6 days ago

Is there anything your team needs from me?

A member of my team tried both "unlimited" and "-1" and ES 8.15 would not start up on debian linux. This might stem from the issue you fixed as the code is not reading the correct setting value. Is there any chance we can get a build with the fix to test things out?

prdoyle commented 1 hour ago

Hey @nathan-maves - I think we have everything we need. I'll try to reproduce today and reach out if I'm unable to do so.

prdoyle commented 48 minutes ago

Actually @nathan-maves - can you please confirm that you have no files in /etc/security/limits.d?