adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
85 stars 101 forks source link

Multiple linux machines resolve invalid host names #984

Closed adam-thorpe closed 4 years ago

adam-thorpe commented 4 years ago

When creating an InetSocketAddress(String host, int port) object, the constructor will pass the host name to InetAddress to see if it is a valid address. If the host name cannot be found, it is marked as unresolved (set to null), which can be tested via the isUnresolved() method. It would seem that addresses are not being marked correctly on the x64 linux machines. Consistent on both openj9 and hotspot.

Test: java/nio/channels/SocketChannel/ExceptionTranslation.java This test is attempting to connect to an invalid host address and is ensuring that it throws an UnknownHostException. However the connect() method hangs and throws:

10:47:39  STDERR:
10:47:39  java.net.SocketTimeoutException
10:47:39    at java.base/sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:129)
10:47:39    at ExceptionTranslation.main(ExceptionTranslation.java:40)
10:47:39    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
10:47:39    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
10:47:39    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
10:47:39    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
10:47:39    at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:298)
10:47:39    at java.base/java.lang.Thread.run(Thread.java:832)
adam-thorpe commented 4 years ago

Each machine seems to find the same few IP addresses, so the godaddy c7 machine and softlayer rhel machines I tested on outputted the same IP, but godaddy ubuntu was different (I won't put the IPs here in case they are important). These don't seem to be linked to the hostname as changing that didn't effect the results.

sxa commented 4 years ago

IPs are all in https://github.com/AdoptOpenJDK/openjdk-infrastructure/blob/master/ansible/inventory.yml so not especially sensitive - can you provide a sample java app (just a few lines of real code) with the names that are producing undesirable results to aid debugging of this please?

adam-thorpe commented 4 years ago

TestIP.zip example output: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/1049

adam-thorpe commented 4 years ago

Other architectures do seem to be failing this test, however only a couple machines seem to be affected. test-marist-ubuntu1604-s390x-1 is an example of a passing z linux box

sxa commented 4 years ago

Like the docker issues we've seen this appears to be something specific to the godaddy hosting infrastructure:

adoptopenjdk@test-godaddy-ubuntu1604-x64-4:~$ ping -c 1 randomhostnamestring
PING randomhostnamestring.dc1.corp.gd (185.53.178.6) 56(84) bytes of data.
64 bytes from 185.53.178.6: icmp_seq=1 ttl=46 time=17.8 ms

--- randomhostnamestring.dc1.corp.gd ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 17.847/17.847/17.847/0.000 ms
adoptopenjdk@test-godaddy-ubuntu1604-x64-4:~$ ping -c 1 completely.different.host.name
PING completely.different.host.name.dc1.corp.gd (185.53.178.6) 56(84) bytes of data.
64 bytes from 185.53.178.6: icmp_seq=1 ttl=46 time=17.6 ms

--- completely.different.host.name.dc1.corp.gd ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 17.695/17.695/17.695/0.000 ms
adoptopenjdk@test-godaddy-ubuntu1604-x64-4:~$ 

Not sure if there's a lot we can do about it unless we try to switch the DNS away from the ones they're configured with or possibly remove dc1.corp.gd from the DNS search list.

sxa commented 4 years ago

As per https://adoptopenjdk.slack.com/archives/C53GHCXL4/p1573832895014600 Demetrius seems happy with us modifying the DNS configuration so I will look at doing that at the start of next week which will hopefully resolve this.

sxa commented 4 years ago

Have removed dc1.corp.gd and hosting.cop.hd from /etc/resolv.conf on t he four GoDaddy Ubuntu 16.04 machines which won't be permanent but will hopefully let us see if it passes tonight

adam-thorpe commented 4 years ago

Test passed: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/1099/

gdams commented 4 years ago

Not sure that this can be closed yet, I assume that @sxa555 will need to make a permanent change to the boxes

adam-thorpe commented 4 years ago

Yes there are other boxes that still fail this test and have not had the same change implemented on them

sxa commented 4 years ago

@adam-thorpe Which machines? Are you seeing it on the non-Ubuntu GoDaddy ones too?

adam-thorpe commented 4 years ago

@sxa555 Yes, I retested this on a GoDaddy debian8 machine which still fails: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/1100/ I'm fairly sure there were a couple more that aren't even GoDaddy machines like the Softlayer Rhel ones. I can try to gather a list if you'd like however I'm pretty sure a large number of boxes are effected

sxa commented 4 years ago

OK thanks - looks like the SL RHEL ones are seeing the issue because the adoptopenjdk.net domain similarly resolves any DNS request underneath it ...

sxa commented 4 years ago

Edited /etc/network/interfaces.d/hfs* to remove dc1.corp.gd and hosting.cop.hd from the dns-search line on the ubuntu 2-4 machines. My credentials for the adoptopenjdk user doesn't seem to work on the -1 ubuntu machine though. @gdams is the password on that one different? If so please send me the new one somehow as these machines don't have the admin team's ssh keys installed

sxa commented 4 years ago

We need to look at how to resolve this for the adoptopenjdk.net domain since many other machines are configured with that as their default domain and will experience the same symptoms

karianna commented 4 years ago

@gdams and I will look at it, should be a *.domainname problem.

karianna commented 4 years ago

Fixed. LMK if that works.

sxa commented 4 years ago

A quick test on the machines previously affected show that the problem no longer exists (possibly except for the entries on the godaddy-1 ubuntu machine which I can't access) so I think we're good almost everywhere now.

adam-thorpe commented 4 years ago

I'll un-exclude the test then and see if it starts passing in the nightlies. This may have affected a bunch of tests which would be nice