Closed sabi0 closed 10 months ago
This must be something on your system preventing socket opening. No idea what though.
I've checked the repro line and seed, works for me. On a second glance, it looks like the security policy of your Java is restricting socket accept. Is it possible that you have such a policy in place (some corporate setup, perhaps)?
I have a vanilla Amazon Corretto JDK 17. And it has this:
// allows anyone to listen on dynamic ports
permission java.net.SocketPermission "localhost:0", "listen";
As far as I can see this is a common practice. Zulu JDK 11 also has the exact same permission. As well as JDK 8.
There is also gradle\testing\randomization\policies\tests.policy
in the project itself that opens it a bit more:
// TestLockFactoriesMultiJVM opens a random port on 127.0.0.1 (port 0 = ephemeral port range):
permission java.net.SocketPermission "127.0.0.1:0", "accept,listen,resolve";
After changing the project's tests.policy
to
permission java.net.SocketPermission "127.0.0.1:1024-", "accept,listen,resolve";
the tests pass on my side.
The special port value 0 refers to the entire ephemeral port range. This is a fixed range of ports a system may use to allocate dynamic ports from. The actual range may be system dependent.
I guess the "ephemeral range" on my system does not include 200xx ports?
Shall I open a PR with this change?
Though I see the server also uses a dynamic port:
s.bind(new InetSocketAddress(hostname, 0));
Maybe the "ephemeral range" in the server JVM is different from the ranges in the clients' JVMs?
You're the first person among many (including CIs) to have experienced this problem, so I'd look at what exactly is causing this first - is it the JDK distribution, is it something else? Port "0" indicates any available port so it should work fine in my opinion - I'm not a network guru though.
Hi, please let's not change this without understanding what the problem is. We have not seen this issue anywhere (not even on Solr where this is used for almost every test). Can you check with another non-corretto JDK? I have the feeling that maybe corretto applied some changes to the permissions. If thats the case, report it to them.
Though I see the server also uses a dynamic port:
s.bind(new InetSocketAddress(hostname, 0));
Maybe the "ephemeral range" in the server JVM is different from the ranges in the clients' JVMs?
The client and the server are the same JVM version with same options.
I've downloaded coretto (Windows 10):
>java -version
openjdk version "11.0.21" 2023-10-17 LTS
OpenJDK Runtime Environment Corretto-11.0.21.9.1 (build 11.0.21+9-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.21.9.1 (build 11.0.21+9-LTS, mixed mode)
and I ran the repro line:
gradlew test --tests TestStressLockFactories.testNativeFSLockFactory -Dtests.seed=2D42F3FDF1FAF153 -Dtests.locale=sg -Dtests.timezone=Australia/Lindeman -Dtests.asserts=true -Dtests.file.encoding=UTF-8
Works for me. It's got to me something else than the JDK, I guess?
How did you run it with Java 11? When I try that I get
ERROR: java version must be between 17 and 21, your version: 11
I ran the test with Oracle's Java 17 and it failed in the same way:
1> Listening on /127.0.0.1:12778...
> java.security.AccessControlException: access denied ("java.net.SocketPermission" "127.0.0.1:12779" "accept,resolve")
> at __randomizedtesting.SeedInfo.seed([2D42F3FDF1FAF153:31931CEF68004D20]:0)
> at java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:485)
How did you run it with Java 11? When I try that I get
ERROR: java version must be between 17 and 21, your version: 11
Hmm... I might have been on branch_9x - didn't check, sorry.
I think the problem you're getting is not due to the JDK but to something preventing processes from binding to local ports. I suspect a firewall rule, perhaps? Nobody else is getting this exception... Can you try it on a different system, perhaps?
I checked again, this time making sure it's Java 21 and the main branch:
I think the problem you're getting is not due to the JDK but to something preventing processes from binding to local ports. I suspect a firewall rule, perhaps?
Then changing gradle\testing\randomization\policies\tests.policy
from
permission java.net.SocketPermission "127.0.0.1:0", "accept,listen,resolve";
to
permission java.net.SocketPermission "127.0.0.1:1024-", "accept,listen,resolve";
wouldn't have helped, I suppose. But it did.
What do you think of catching this AccessControlException
and wrapping it with AssumptionViolatedException ?
Hi,
Sorry no. The test is fine. This test and many more exist like this since years. There's no need to change them or the policy file. Passing 0 as port number on the policy file is correct, because we want to prevent anybody to write a test with a fixed port number. All ports must be empheral.
Unless you give a clear explanation why it fails for you and there is no workaround, we won't change this test. This is definitely a problem in your setup. This test does not fail with any JDK out there.
Thanks, Uwe
You can always work around it by running the tests without security manager. Read the gradle documentation about the responsible system properties, e.g. -Ptests.useSecurityManager=false
.
My understanding of the situation is the following:
Dynamic / ephemeral is only applicable to a local port. Thus permission 127.0.0.1:0/listen
allows to bind to a dynamic local port.
But when accepting a connection from some remote port local system's "ephemeral port range" is not applicable. And the permission 127.0.0.1:0/accept
does not work.
I have no idea why this only happens to me. And I understand your position of not wanting to change the test or the policy.
Just in case this might help someone else the tests also pass with the following permissions:
permission java.net.SocketPermission "127.0.0.1:0", "listen,resolve";
permission java.net.SocketPermission "127.0.0.1:*", "accept,resolve";
I'd like to understand why your system is different than mine (or Uwe's)... It's great that you've found a workaround but it doesn't explain what's happening and - as Uwe mentioned - it's been working fine for everyone for years - there's something different in your setup that requires this workaround and it'd be interesting to figure out what it is!
Do you use multiple network interfaces? Are these normal network adapters or something else? It's really unfortunate that it doesn't work for you out of the box. Strange!
My assumption was wrong. When the permission has port 0 the remote port number is validated against the local system's "ephemeral port range":
if (policyLow == 0 && policyHigh == 0) {
// ephemeral range only
return targetLow >= ephemeralLow && targetHigh <= ephemeralHigh;
}
The range itself is defined by jdk.net.ephemeralPortRange.low
/ jdk.net.ephemeralPortRange.high
system properties.
And when those are not set the range defaults to 49152 - 65535:
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/net/SocketPermission.java#L1228
So on my system this prints "false":
SocketPermission policy = new SocketPermission("127.0.0.1:0", "accept,listen");
SocketPermission request = new SocketPermission("127.0.0.1:20022", "accept");
System.out.println(policy.implies(request));
and this prints "true":
SocketPermission policy = new SocketPermission("127.0.0.1:0", "accept,listen");
SocketPermission request = new SocketPermission("127.0.0.1:50123", "accept");
System.out.println(policy.implies(request));
Probably the "ephemeral port range" in the network stack and in the SocketPermission are somehow out of sync?
I found this snippet in DNSDatagramSocketFactory.open()
javadoc:
if binding a socket to port 0 binds it to a random port) then the underlying OS implementation is used. Otherwise, this method will allocate and bind a socket on a randomly selected ephemeral port in the dynamic range.
So when OS allocates a random port it does not necessarily fall in the JVM's ephemeral port range?
This does not break 127.0.0.1:0/listen
because the permission is checked before binding (when the actual port number is still not known). But 127.0.0.1:0/accept
is out of luck.
Note: I corrected this answer.
Hi, it looks like your linux kernel has an extended ephemeral port range. The RFC defines it to be 49152-65535 (see RFC 6335).
The range itself is defined by
jdk.net.ephemeralPortRange.low
/jdk.net.ephemeralPortRange.high
system properties. And when those are not set the range defaults to 49152 - 65535: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/net/SocketPermission.java#L1228
This is not fully true. If the sysprops are undefined (which is by default), it only uses 49152-65535 on Windows (as Windows adheres to the standard). This default range is defined in PortConfig
class which has several implementation depending on operating system. For Linux it uses this class: https://github.com/openjdk/jdk/blob/28c82bf18d85be00bea45daf81c6a9d665ac676f/src/java.base/unix/classes/sun/net/PortConfig.java#L36; for Windows it uses: https://github.com/openjdk/jdk/blob/28c82bf18d85be00bea45daf81c6a9d665ac676f/src/java.base/windows/classes/sun/net/PortConfig.java#L33
In short the default range is defined in a platform dependent way, but on Linux, it uses a hardcoded default range on linux, but later it also reads the platform's defaults (see below):
case LINUX:
defaultLower = 32768;
defaultUpper = 61000;
This matches the defaults in Linux kernel variables (see also Linux source code):
# sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768 60999
As one can change those variables with the sysctl command or by using /etc/sysctl.conf
or /etc/systcl.d
, it also reads in native code the values from sysctl variables from /proc
filesystem: https://github.com/openjdk/jdk/blob/28c82bf18d85be00bea45daf81c6a9d665ac676f/src/java.base/unix/native/libnet/portconfig.c#L50-L62
But this could fail, if for example the /proc/sys/net
path is not available/readable (due to selinux, firewall, virus scanner/...). It then returns -1 and then the code falls back to the hardcoded default 32768-61000. So make sure that Java has enough access rights to read the /proc filesystem. If you use Docker images or similar you are on your own.
So please print what your current kernel uses by executing sysctl net.ipv4.ip_local_port_range
and if it does not adhere to the default please fix your config. Alternatively set the system properties.
Oh I see you have Windows. On Windows it uses a hardcoded range. It is not dynamic, so it looks like your windows system has changed it away from the defaults!
We can't help with that, please fix your Windows installation or open bug report on OpenJDK that they make the Windows range dynamic.
On Windows the defaults can also be changed, but OpenJDK does not read those settings. Here is my Windows example (Windows 10):
> netsh int ipv4 show dynamicport tcp
Protocol tcp dynamic port range
---------------------------------
Start port : 49152
Number of ports : 16384
So either fix your Windows network stack to use the defaults or open a bug report to fix the hard coded range in OpenJDK.
Heres how to change the settings (you may need to persist them): https://learn.microsoft.com/en-us/troubleshoot/windows-server/networking/default-dynamic-port-range-tcpip-chang
Indeed, my machine has this:
Protocol tcp Dynamic Port Range
---------------------------------
Start Port : 1024
Number of Ports : 20977
I do not know why this was changed by our corp. IT. I guess they had a reason to.
I agree that OpenJDK not reading the OS settings is the root cause. But suggesting everyone to "fix their Windows" or wait for some Java 27 to fix this or otherwise run the tests with the SecurityManager completely off is unnecessarily rigid IMO.
I believe this is the perfect fit for the AssumptionViolatedException. The test assumes the ports are allocated within JDK's ephemeral port range. Now we know for a fact that this might not be the case on Windows. And that's the assumption violation, not the test failure. I can open a PR for that if you want. Otherwise I suggest us all to move on.
P.S. Another similar case I ran into recently: creating a symlink on Windows with UAC requires elevated permissions causing Elasticsearch test to fail.
Sorry, your computer does not behave standards conform. Please report this to your organisation.
There is no reason to change anything in Lucene.
Please also run Apache Solr test. It uses same config. Applying your proposed fix will disable all integration tests. This is not the correct way to fix this.
So an "assume" here isn't the correct way to fix it.
P.S.: in Lucene we won't throw assumption violation exceptions. We have LTC#assumeTrue for this.
sorry, i'm late to the party. yes, the entire purpose of this is to ensure tests only use ephemeral ports when binding. otherwise there will be port conflicts. so we should not be lenient about it.
seems like any issue here is in the JDK not respecting the operating system's configuration: not in lucene.
Thanks Robert.
To clarify: This test is so important for data safety in Lucene that silently disabling it on highly incompetent sysadmin's decisions is a No-Go.
Please don't open any more issues or PRs about this. Thanks.
But suggesting everyone to "fix their Windows" or wait for some Java 27 to fix this or otherwise run the tests with the SecurityManager completely off is unnecessarily rigid IMO.
Thank you for a thorough investigation into the cause of the failure - it is really enlightening. I agree with the others that making exceptions for broken system setups is probably not the right way to go. What happened to you has never been reported before, so feel unique. :) The Lucene test case setup is quite strict but this strictness has a purpose - find the problems early. This issue is a testament to how weird the real world systems can be and that the test infrastructure is actually doing its job quite well!
If you need a more permanent workaround, you can turn off the security manager in your locally generated gradle.properties - sure, you won't be running the full test suite but any PR will do it anyway, so it seems fine. Thanks again for your time spent on this.
Description
testNativeFSLockFactory
testSimpleFSLockFactory
Gradle command to reproduce
gradlew test --tests TestStressLockFactories.testNativeFSLockFactory -Dtests.seed=2D42F3FDF1FAF153 -Dtests.locale=sg -Dtests.timezone=Australia/Lindeman -Dtests.asserts=true -Dtests.file.encoding=UTF-8
gradlew test --tests TestStressLockFactories.testSimpleFSLockFactory -Dtests.seed=2D42F3FDF1FAF153 -Dtests.locale=sg -Dtests.timezone=Australia/Lindeman -Dtests.asserts=true -Dtests.file.encoding=UTF-8