adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
85 stars 101 forks source link

Test machines naming with '-XJ' don't have required module installed #808

Closed sophia-guo closed 4 years ago

sophia-guo commented 5 years ago

Test running on https://ci.adoptopenjdk.net/computer/test-macincloud-macos1010-3-XJ/ failed with message:

06:31:54  Can't locate Text/CSV.pm in @INC (you may need to install the Text::CSV module) (@INC contains: ./makeGenTool /Library/Perl/5.18/darwin-thread-multi-2level /Library/Perl/5.18 /Network/Library/Perl/5.18/darwin-thread-multi-2level /Network/Library/Perl/5.18 /Library/Perl/Updates/5.18.2 /System/Library/Perl/5.18/darwin-thread-multi-2level /System/Library/Perl/5.18 /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level /System/Library/Perl/Extras/5.18 .) at makeGenTool/parseFiles.pl line 27.
06:31:54  BEGIN failed--compilation aborted at makeGenTool/parseFiles.pl line 27.
06:31:54  Compilation failed in require at makeGenTool/mkgen.pl line 93.
06:31:54  Using projectRootDir: /Users/jenkins/workspace/openjdk8_j9_openjdktest_x86-64_macos/openjdk-tests/TestConfig/scripts/testKitGen/../../..
06:31:54  Getting modes data from modes.xml and ottawa.csv...
06:31:54  settings.mk:54: /Users/jenkins/workspace/openjdk8_j9_openjdktest_x86-64_macos/openjdk-tests/TestConfig/../TestConfig/utils.mk: No such file or directory
06:31:54  makefile:39: count.mk: No such file or directory
06:31:54  make: *** No rule to make target `count.mk'.  Stop.

https://ci.adoptopenjdk.net/view/Test_openjdk/job/openjdk8_j9_openjdktest_x86-64_macos/217/consoleFull

Text/CSV.pm is required for running Testkitgen

sophia-guo commented 5 years ago

Same issue for tests on s390x test-marist-ubuntu1604-s390x-2-XJ.

sxa commented 5 years ago

The s390x box should be ok now - let me know if there are any further isssues

sxa commented 5 years ago

Have now also installed Text::CSV, XML::Parser and JSON onto the mac system

sophia-guo commented 5 years ago

antlib is also missing on mac system. https://ci.adoptopenjdk.net/view/Test_openjdk/job/openjdk11_hs_openjdktest_x86-64_macos/225/console

andrew-m-leonard commented 5 years ago

disabled node: test-macincloud-macos1010-3-XJ

sxa commented 5 years ago

brew install ant-contrib executed and linked /usr/local/Cellar/ant-contrib/1.0b3/share/ant/ant-contrib-1.0b3.jar to /usr/local/Cellar/ant/1.10.1/lib/ant-contrib.jar

sxa commented 5 years ago

https://ci.adoptopenjdk.net/job/openjdk8_hs_openjdktest_x86-64_macos/373/ has passed the problematic section so the above appears to have worked.

sophia-guo commented 5 years ago

@sxa555 I see that https://ci.adoptopenjdk.net/computer/test-macincloud-macos1010-3-XJ/ is still offline and jobs are waiting for available machines. Could you re-enable it? Thanks!

sophia-guo commented 5 years ago

@sxa555 both jdk and system test jobs on https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1604-s390x-2-XJ/ are running with unexpected long time and extra errors.

System tests : running on https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1604-s390x-1/ pass, 2 hours running on https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1604-s390x-2-XJ/ failed, 4 hours https://ci.adoptopenjdk.net/view/Test_system/job/openjdk11_j9_systemtest_s390x_linux/

jdk Tests: running on https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1604-s390x-1/ failed, round 1.5 hours running on https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1604-s390x-2-XJ/ failed, 9 hours and time out https://ci.adoptopenjdk.net/view/Test_openjdk/job/openjdk11_j9_openjdktest_s390x_linux/

Similar issues for other version or implements.

Wondered any configuration difference between those two machines?

sxa commented 5 years ago

Please ensure that if an issue still persists on a closed issue that you reopen it or any comments will likely not be actioned.

I've taken the macos box back online.

For the s390x box are all the errors network timeouts? (I'm basing that on run 260 of the job you mentioned)

sophia-guo commented 5 years ago

Unfortunately I don't have this repo's reopen permission :-(

Yes, most of failures are timeouts, which make the job take much longer time than on the other machine and make the build timeouts. I wondered if any configuration hidden issue?

sxa commented 5 years ago

My question was whether they were all network timeouts specifically - are they?

I'm not sure what "configuration hidden issue" is suggesting - if there's an issue we need to debug and identify it as I can't tell what's wrong at the moment :-)

sxa commented 5 years ago

We need to know what operations in particular are getting stuck to be able to debug this further

lumpfish commented 5 years ago

The following tests are failing on test-macincloud-macos1010-3-XJ but pass on test-macincloud-macos1010-1

java/util/prefs/AddNodeChangeListener.java.AddNodeChangeListener
java/util/prefs/CheckUserPrefsStorage.sh.CheckUserPrefsStorage
java/util/prefs/RemoveReadOnlyNode.java.RemoveReadOnlyNode
java/util/prefs/RemoveUnregedListener.java.RemoveUnregedListener

The prefs tests are the same ones as were failing here: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8079418. The underlying issue there was user permissions - but also that is now 'resolved'.

sxa commented 5 years ago

NOTE: mac machine has been renamed from test-macincloud-macos1010-3-XJ to test-macstadium-macos1010-1-XJ as the hosting provider was incorrect

gdams commented 5 years ago

@sxa555 can I close this as the machine has been deleted (https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/849)

sxa commented 5 years ago

@gdams No this should be kept open as this covers issues with more than just the macos machine (Thanks for not leaving this closed @karianna)

@sophia-guo @lumpfish As per earlier question are the failures on the s390x box all network timeouts? We need to get this understood and resolved as it seems to be the cause of a lot of zLinux slowness at the moment. Can someone who understands the test suite determine what specific operations are hanging on the machine?

sxa commented 5 years ago

I'm going to abort #13 on https://ci.adoptopenjdk.net/job/Test_openjdk8_j9_sanity.openjdk_s390x_linux/13/ for now so I can quiesce test-marist-ubuntu1604-s390x-2-XJ and see if there are any processes left around.

sxa commented 5 years ago

Answer: lots Ref: jenkins.maristXJ.log.gz

Here is a samsnippet of the ps listing with the July 29th stuck processes - 19 of them of which 16 were from a base openjdktest run:

sxa@x220t:~$ gzip -cd jenkins.maristXJ.log.gz | grep Jun29 | cut -c-200 | grep openjdktest_s
jenkins  48465  0.0  0.2 2089944 23408 ?       SLl  Jun29   0:40 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  48514  0.0  0.2 2089944 23292 ?       SLl  Jun29   0:41 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  50493  0.0  0.2 2089944 22380 ?       SLl  Jun29   0:41 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  51539  0.0  0.2 2089944 23072 ?       SLl  Jun29   0:42 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  52199  0.0  0.2 2089944 23136 ?       SLl  Jun29   0:40 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  53244  0.0  0.2 2089944 23664 ?       SLl  Jun29   0:42 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  54032  0.0  0.2 2089944 23556 ?       SLl  Jun29   0:41 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  55486  0.0  0.1 2090200 14856 ?       SLl  Jun29   0:41 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  56161  0.0  0.2 2089944 22812 ?       SLl  Jun29   0:42 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  57244  0.0  0.2 2089944 22720 ?       SLl  Jun29   0:37 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  57923  0.0  0.2 2089944 22852 ?       SLl  Jun29   0:41 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  59612  0.0  0.1 2089944 14888 ?       SLl  Jun29   0:38 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  61204  0.0  0.1 2089944 14600 ?       SLl  Jun29   0:40 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  62194  0.0  0.1 2089944 14836 ?       SLl  Jun29   0:39 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  63804  0.0  0.1 2090200 14640 ?       SLl  Jun29   0:41 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins
jenkins  64710  0.0  0.1 2089944 14556 ?       SLl  Jun29   0:42 /home/jenkins/workspace/openjdk8_j9_openjdktest_s390x_linux/openjdkbinary/j2sdk-image/jre/bin/java -Djava.security.policy=/home/jenkins

(For the record, these process listings are also created regularly and are visible at https://ci.adoptopenjdk.net/job/SXA-processCheck/label=test-marist-ubuntu1604-s390x-2-XJ/)

I have cleared out the processes (close to 100 of them), re-enabled the executor and https://ci.adoptopenjdk.net/job/Test_openjdk13_j9_sanity.system_s390x_linux/7/ is the first job to get scheduled on it

FYI @smlambert

smlambert commented 5 years ago

Hmm, processes from back on Jun29.

Possibly related: https://github.com/AdoptOpenJDK/openjdk-tests/issues/1071 https://github.com/AdoptOpenJDK/openjdk-tests/issues/1051

sxa commented 5 years ago

Jun29 was just me attempting to show a sample snapshot from a random day :-) Thanks for those two links - I figured you might have some other issues on this somewhere so great to have them all linked now. Not all of the hung processes were from Hotspot runs but they could have been the trigger for others failing.

sophia-guo commented 5 years ago

@sxa555 For JDK tests yes, almost failing tests (around 110) are rmi, nio, net group. The error message is either ' timeout ' or 'Cannot assign requested address' (which is assign a network address). https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk8_hs_sanity.openjdk_s390x_linux/17/#showFailuresLink

sxa commented 4 years ago

SInce the original Marist machines have now been decomissioned, both -XJ machines that this issue refers to are no longer in the test machine set, therefore closing